Pandas - Index
What's new in pandas 2.0
- Arrow / PyArrow: Faster and More Memory-efficient Operations
- Pandas was built using NumPy data structures for memory management. In 2.0 you can use pyarrow as the backing memory format.
- PyArrow is a Python library (built on top of Arrow)
- Arrow: written in C++; an open-source and language-agnostic columnar data format to represent data in memory. It can enable zero-copy sharing of data between processes.
- Polars (similar to Arrow) is a Rust-based data manipulation library for Python that provides a DataFrame API similar to pandas, but with enhanced performance and scalability for large datasets.
- Copy-on-Write Performance Enhancement: It makes Pandas more similar to Spark and how lazy operations are performed in Spark.
- the Index feature has been expanded to include NumPy numeric dtypes, such as
int8
,int16
,int32
,uint8
,uint16
,uint32
,uint64
,float32
, andfloat64
, whereas previously onlyint64
,uint64
, andfloat64
types were supported.
How to Install / Upgrade Pandas
Check Version
>>> import pandas as pd
>>> pd.__version__
Install or Upgrade
$ pip3 install -U pandas
DataFrame
get index
df.index
get columns
df.columns
Read As Pandas DataFrame
http://pandas.pydata.org/pandas-docs/stable/io.html
df = pd.read_csv("train.csv")
then convert DataFrame to arrays:
data = pd.read_csv("train.csv").values
Skip the first column and convert data to float
X = df.values[:, 1:].astype(float)
Extract first column as Y
Y = df.values[:, 0]
Other methods:
- pd.read_csv
- pd.read_excel
- pd.read_hdf
- pd.read_sql
- pd.read_json
- pd.read_msgpack (experimental)
- pd.read_html
- pd.read_gbq (experimental)
- pd.read_stata
- pd.read_sas
- pd.read_clipboard
- pd.read_pickle
Write
Write From Pandas DataFrame
Write to csv
df.to_csv("data.csv")
Other methods:
- df.to_csv
- df.to_excel
- df.to_hdf
- df.to_sql
- df.to_json
- df.to_msgpack (experimental)
- df.to_html
- df.to_gbq (experimental)
- df.to_stata
- df.to_clipbodf.ard
- df.to_pickle
Write as JSON
This is similar to the problem dumping JSON in NumPy:
>>> json.dumps(pd.Series([1,2,3]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 0 1
1 2
2 3
dtype: int64 is not JSON serializable
>>> json.dumps(pd.Series([1,2,3]).values)
Traceback (most recent call last):
...
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: array([1, 2, 3]) is not JSON serializable
Convert to list first can solve the problem
>>> json.dumps(pd.Series([1,2,3]).values.tolist())
'[1, 2, 3]'