logo

Pandas - Index

What's new in pandas 2.0

  • Arrow / PyArrow: Faster and More Memory-efficient Operations
    • Pandas was built using NumPy data structures for memory management. In 2.0 you can use pyarrow as the backing memory format.
    • PyArrow is a Python library (built on top of Arrow)
    • Arrow: written in C++; an open-source and language-agnostic columnar data format to represent data in memory. It can enable zero-copy sharing of data between processes.
    • Polars (similar to Arrow) is a Rust-based data manipulation library for Python that provides a DataFrame API similar to pandas, but with enhanced performance and scalability for large datasets.
  • Copy-on-Write Performance Enhancement: It makes Pandas more similar to Spark and how lazy operations are performed in Spark.
  • the Index feature has been expanded to include NumPy numeric dtypes, such as int8, int16, int32, uint8, uint16, uint32, uint64, float32, and float64, whereas previously only int64, uint64, and float64 types were supported.

How to Install / Upgrade Pandas

Check Version

>>> import pandas as pd
>>> pd.__version__

Install or Upgrade

$ pip3 install -U pandas

DataFrame

get index

df.index

get columns

df.columns

Read As Pandas DataFrame

http://pandas.pydata.org/pandas-docs/stable/io.html

df = pd.read_csv("train.csv")

then convert DataFrame to arrays:

data = pd.read_csv("train.csv").values

Skip the first column and convert data to float

X = df.values[:, 1:].astype(float)

Extract first column as Y

Y = df.values[:, 0]

Other methods:

  • pd.read_csv
  • pd.read_excel
  • pd.read_hdf
  • pd.read_sql
  • pd.read_json
  • pd.read_msgpack (experimental)
  • pd.read_html
  • pd.read_gbq (experimental)
  • pd.read_stata
  • pd.read_sas
  • pd.read_clipboard
  • pd.read_pickle

Write

Write From Pandas DataFrame

Write to csv

df.to_csv("data.csv")

Other methods:

  • df.to_csv
  • df.to_excel
  • df.to_hdf
  • df.to_sql
  • df.to_json
  • df.to_msgpack (experimental)
  • df.to_html
  • df.to_gbq (experimental)
  • df.to_stata
  • df.to_clipbodf.ard
  • df.to_pickle

Write as JSON

This is similar to the problem dumping JSON in NumPy:

>>> json.dumps(pd.Series([1,2,3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 0    1
1    2
2    3
dtype: int64 is not JSON serializable
>>> json.dumps(pd.Series([1,2,3]).values)
Traceback (most recent call last):
  ...
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: array([1, 2, 3]) is not JSON serializable

Convert to list first can solve the problem

>>> json.dumps(pd.Series([1,2,3]).values.tolist())
'[1, 2, 3]'