logo

Pandas - DataFrame

Filter Rows

$ pip install numexpr
>>> from numpy.random import randn
>>> import pandas as pd
>>> df = pd.DataFrame(randn(10, 2), columns=list('ab'))
>>> df
          a         b
0 -0.428884 -0.567039
1  0.228477 -0.143088
2 -0.317275 -1.050987
3  0.374664 -0.548143
4  2.144432  0.446891
5 -1.796953 -0.761425
6 -2.076286 -1.745856
7  0.262923  0.479044
8 -1.398169  1.597476
9  0.491541 -0.648251

Filter by .query()

>>> df.query('a > b')
          a         b
0 -0.428884 -0.567039
1  0.228477 -0.143088
2 -0.317275 -1.050987
3  0.374664 -0.548143
4  2.144432  0.446891
9  0.491541 -0.648251

Filter by comparing columns:

>>> df[df['a'] > df['b']]
          a         b
0 -0.428884 -0.567039
1  0.228477 -0.143088
2 -0.317275 -1.050987
3  0.374664 -0.548143
4  2.144432  0.446891
9  0.491541 -0.648251

Conversion

Series => DataFrame

Series.reset_index(drop=False)

Series => Series

Series.reset_index(drop=True)

Drop

Drop by column name

df = df.drop('column_name', 1)

# or
df = df.drop('column_name', axis=1)

inplace:

df.drop('column_name', axis=1, inplace=True)

Drop by column number

df.drop(df.columns[[0, 1, 3]], axis=1)

Drop first column:

df.iloc[:, 1:]

Drop first row:

df.drop(0)

Add Column

add constant column:

df['new'] = 0

Insert First Column

df.insert(0, 'mean', df.mean(1))

Rename Column

Option 1: Modify columns

df.columns = ['Leader', 'Time', 'Score']

Option 2: use df.rename()

df.rename(columns={'Leader': 'Commander'}, inplace=True)

Transpose

Specify index before transpose.

df = pd.read_csv("daily_mean.csv")

df = df.set_index("date")

df = df.sort_index()

df = df.transpose()

Iterate Row By Row

Use df.iterrows()

for index, row in df.iterrows():
    # do something

Add Another Column

E.g. if missing_rate is larger than 0.5, set usagetype to SUPPLEMENTARY, otherwise ACTIVE

merged.loc[:, 'usagetype'] = merged['missing_rate'].apply(lambda x: 'SUPPLEMENTARY' if x > 0.5 else ‘ACTIVE’)

for each row

merged.loc[:, 'optype'] = merged.apply(lambda x: ..., axis=1)

or use map()

df['new_col'] = df['old_col'].map(lambda x: ...)

Reorder Columns

df.reindex_axis(sorted(df.columns), axis=1)

Replace nan with 0

df.fillna(0)

How to trim all the fields

df = df.apply(lambda x: x.str.strip())

How to delete the last column

df = df.iloc[:, :-1]

How to apply the user-defined function to every row

# User defined function
def foo(row):
    return row['A'] + row['B']

# Apply the function to each row (axis=1)
df['bar'] = df.apply(foo, axis=1)

How to check if dataframe has column?

if 'A' in df:

# or
if 'A' in df.columns:

What's equivalent to LIMIT in SQL?

  • df.iloc[:1000] or df.head(1000): get first 1000 rows.
  • df.iloc[-1000:] or df.tail(1000): get last 1000 rows.

categorical to continuous

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html