Pandas - DataFrame
Filter Rows
$ pip install numexpr
>>> from numpy.random import randn
>>> import pandas as pd
>>> df = pd.DataFrame(randn(10, 2), columns=list('ab'))
>>> df
a b
0 -0.428884 -0.567039
1 0.228477 -0.143088
2 -0.317275 -1.050987
3 0.374664 -0.548143
4 2.144432 0.446891
5 -1.796953 -0.761425
6 -2.076286 -1.745856
7 0.262923 0.479044
8 -1.398169 1.597476
9 0.491541 -0.648251
Filter by .query()
>>> df.query('a > b')
a b
0 -0.428884 -0.567039
1 0.228477 -0.143088
2 -0.317275 -1.050987
3 0.374664 -0.548143
4 2.144432 0.446891
9 0.491541 -0.648251
Filter by comparing columns:
>>> df[df['a'] > df['b']]
a b
0 -0.428884 -0.567039
1 0.228477 -0.143088
2 -0.317275 -1.050987
3 0.374664 -0.548143
4 2.144432 0.446891
9 0.491541 -0.648251
Conversion
Series => DataFrame
Series.reset_index(drop=False)
Series => Series
Series.reset_index(drop=True)
Drop
Drop by column name
df = df.drop('column_name', 1)
# or
df = df.drop('column_name', axis=1)
inplace:
df.drop('column_name', axis=1, inplace=True)
Drop by column number
df.drop(df.columns[[0, 1, 3]], axis=1)
Drop first column:
df.iloc[:, 1:]
Drop first row:
df.drop(0)
Add Column
add constant column:
df['new'] = 0
Insert First Column
df.insert(0, 'mean', df.mean(1))
Rename Column
Option 1: Modify columns
df.columns = ['Leader', 'Time', 'Score']
Option 2: use df.rename()
df.rename(columns={'Leader': 'Commander'}, inplace=True)
Transpose
Specify index before transpose.
df = pd.read_csv("daily_mean.csv")
df = df.set_index("date")
df = df.sort_index()
df = df.transpose()
Iterate Row By Row
Use df.iterrows()
for index, row in df.iterrows():
# do something
Add Another Column
E.g. if missing_rate
is larger than 0.5, set usagetype
to SUPPLEMENTARY
, otherwise ACTIVE
merged.loc[:, 'usagetype'] = merged['missing_rate'].apply(lambda x: 'SUPPLEMENTARY' if x > 0.5 else ‘ACTIVE’)
for each row
merged.loc[:, 'optype'] = merged.apply(lambda x: ..., axis=1)
or use map()
df['new_col'] = df['old_col'].map(lambda x: ...)
Reorder Columns
df.reindex_axis(sorted(df.columns), axis=1)
Replace nan with 0
df.fillna(0)
How to trim all the fields
df = df.apply(lambda x: x.str.strip())
How to delete the last column
df = df.iloc[:, :-1]
How to apply the user-defined function to every row
# User defined function
def foo(row):
return row['A'] + row['B']
# Apply the function to each row (axis=1)
df['bar'] = df.apply(foo, axis=1)
How to check if dataframe has column?
if 'A' in df:
# or
if 'A' in df.columns:
What's equivalent to LIMIT in SQL?
df.iloc[:1000]
ordf.head(1000)
: get first 1000 rows.df.iloc[-1000:]
ordf.tail(1000)
: get last 1000 rows.
categorical to continuous
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html