Pandas - Index

What's new in pandas 2.0

Arrow / PyArrow: Faster and More Memory-efficient Operations
- Pandas was built using NumPy data structures for memory management. In 2.0 you can use pyarrow as the backing memory format.
- PyArrow is a Python library (built on top of Arrow)
- Arrow: written in C++; an open-source and language-agnostic columnar data format to represent data in memory. It can enable zero-copy sharing of data between processes.
- Polars (similar to Arrow) is a Rust-based data manipulation library for Python that provides a DataFrame API similar to pandas, but with enhanced performance and scalability for large datasets.
Copy-on-Write Performance Enhancement: It makes Pandas more similar to Spark and how lazy operations are performed in Spark.
the Index feature has been expanded to include NumPy numeric dtypes, such as int8, int16, int32, uint8, uint16, uint32, uint64, float32, and float64, whereas previously only int64, uint64, and float64 types were supported.

How to Install / Upgrade Pandas

Check Version

>>> import pandas as pd
>>> pd.__version__

Install or Upgrade

$ pip3 install -U pandas

DataFrame

get index

df.index

get columns

df.columns

Read As Pandas DataFrame

http://pandas.pydata.org/pandas-docs/stable/io.html

df = pd.read_csv("train.csv")

then convert DataFrame to arrays:

data = pd.read_csv("train.csv").values

Skip the first column and convert data to float

X = df.values[:, 1:].astype(float)

Extract first column as Y

Y = df.values[:, 0]

Other methods:

pd.read_csv
pd.read_excel
pd.read_hdf
pd.read_sql
pd.read_json
pd.read_msgpack (experimental)
pd.read_html
pd.read_gbq (experimental)
pd.read_stata
pd.read_sas
pd.read_clipboard
pd.read_pickle

Write

Write From Pandas DataFrame

Write to csv

df.to_csv("data.csv")

Other methods:

df.to_csv
df.to_excel
df.to_hdf
df.to_sql
df.to_json
df.to_msgpack (experimental)
df.to_html
df.to_gbq (experimental)
df.to_stata
df.to_clipbodf.ard
df.to_pickle

Write as JSON

This is similar to the problem dumping JSON in NumPy:

>>> json.dumps(pd.Series([1,2,3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 0    1
1    2
2    3
dtype: int64 is not JSON serializable
>>> json.dumps(pd.Series([1,2,3]).values)
Traceback (most recent call last):
  ...
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: array([1, 2, 3]) is not JSON serializable

Convert to list first can solve the problem

>>> json.dumps(pd.Series([1,2,3]).values.tolist())
'[1, 2, 3]'