Efficient reading and parsing of large CSV files in Python with Pandas and Arrow

Python
Pandas
Author

Vincent Grégoire, PhD, CFA

Published

March 1, 2024

CSV files are ubiquitous in empirical finance and data science more generally. They are easy to create and read and are supported by most data processing tools. The pandas.read_csv() function is flexible and easy to use, but it can be slow and memory-intensive for large files if not used properly.

In this post, I will show you how to read and parse large CSV files efficiently in Python using the pyarrow library with pandas. I will also show you how to convert the CSV file to the Parquet format, which is a columnar storage format that is more efficient than CSV for reading and writing data.

Video tutorial

This post is also available as a video tutorial on YouTube.

Apache Arrow and the pyarrow library

According to the website,

Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system (or programming language to another).

A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.

Let’s unpack this a bit:

  • Arrow defines a standard way to represent structured data (think table or dataframe) in memory. This is important because:
    • Developers can build tools and libraries using Arrow that are compatible with each other and they do not have to design everything from scratch.
    • It allows different systems and programming languages to share data without having to convert it to a different format.
    • It is designed to be efficient and fast, which is important for large datasets and high-performance computing.
  • The data is stored in columns, which is different from the row-based storage of CSV files and most databases. This is important because:
    • Columnar storage is more efficient for many types of data processing, especially for analytics and data science.
  • Arrow has a rich data type system, which is important because:
    • Data can be represented in a standardized way, which makes it easier for interoperability between different systems and programming languages.
    • Data can be represented in a way that is more efficient and flexible.
    • It can represent complex data structures, such as nested data and user-defined data types.

Arrow provides a set of libraries for different programming languages, including C++, Java, Python, and R. The pyarrow library is the Python implementation of Arrow. It provides a way to work with Arrow data in Python, including converting data between Arrow and other formats, such as CSV and Parquet. A big change that came with pandas 2.0 is thatpyarrow can now be used with pandas not only to read and write data, but also as a dtype-backend, which means that pandas can use the Arrow in-memory format to store and process data.

Of note, Wes McKinney who created pandas is also a co-creator of Apache Arrow.

import pandas as pd
import pyarrow as pa
import time

print(f"pandas version: {pd.__version__}")
print(f"pyarrow version: {pa.__version__}")
pandas version: 2.1.4
pyarrow version: 14.0.1

The problem: reading large CSV files in Python

Reading large CSV files in Python with pandas can take time and slow down your workflow, especially if it is done repeatedly.

As an example, I use a large CSV file containing daily stock information such as prices, returns, and trading volumes for publicly listed U.S. stocks from the Center for Research in Security Prices (CRSP) that I obtained from Wharton Research Data Services (WRDS). The file contains 8,883,156 rows and 64 columns of different data types. The original file is a CSV file compressed in the gzip format (.csv.gz) that is 486.4 MB in size. The uncompressed file is 3.91 GB in size.

The timing results presented in this post are obtained on a MacBook Pro with a M3 Max processor and 64 GB of RAM. If doing things properly, RAM should not be a limiting factor for reading and processing the sample file. My tips will focus on speed first and memory usage second.

First, let’s take a look at the first few rows of the file to understand its structure:

df = pd.read_csv("files/crsp_dsf.csv", nrows=100)
df.head()
PERMNO SecInfoStartDt SecInfoEndDt SecurityBegDt SecurityEndDt SecurityHdrFlg HdrCUSIP HdrCUSIP9 CUSIP CUSIP9 ... DlyVol DlyClose DlyLow DlyHigh DlyBid DlyAsk DlyOpen DlyNumTrd DlyMMCnt DlyPrcVol
0 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 88291 181.68 181.440 185.30 181.67 181.70 185.30 2300 44 16040708.9
1 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 71463 184.91 180.890 185.17 184.90 184.91 180.89 1572 44 13214223.3
2 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 70308 185.07 183.140 185.52 184.77 185.07 184.18 1764 44 13011901.6
3 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 72267 183.03 182.475 184.63 182.90 183.22 184.40 2032 44 13227029.0
4 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 118592 182.04 181.810 183.89 182.04 182.23 182.73 2208 45 21588487.7

5 rows × 64 columns

Tip 1: Keep your CSV files compressed

Reading a large CSV file with pandas is slow, but it is not slower if compressed. Compressed files take less space on disk and are sometimes faster to read. In my experience, if you are using a modern computer with a fast SSD, reading compressed and uncompressed files is about the same speed, but if you are using a hard drive or other slower storage, reading compressed files will often be faster.

The gzip format is a good choice for compressing CSV files. It is supported by most data processing tools and is fast to read and write. I found that the xz format is even better in terms of compression ratio for relatively similar performance, but it is not as widely supported as gzip.

x = time.time()
df = pd.read_csv("files/crsp_dsf.csv")
print(f"Reading uncompressed file: {time.time() - x:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/1706856631.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("files/crsp_dsf.csv")
Reading uncompressed file: 23.8861 seconds
x = time.time()
df = pd.read_csv("files/crsp_dsf.csv.gz")
print(f"Reading compressed file: {time.time() - x:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/3793337616.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("files/crsp_dsf.csv.gz")
Reading compressed file: 25.3693 seconds

The times are very similar. Looking at the resulting DataFrame, we can see that the file contains 64 columns of different data types. The first few columns are identifiers such as the stock ticker, the date, and the exchange code. The remaining columns contain information about the stock such as prices, returns, and trading volumes. We can also see that the parsing of the file is incomplete as the dates are all of the object type. We will fix this shortly.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype  
---  ------             -----  
 0   PERMNO             int64  
 1   SecInfoStartDt     object 
 2   SecInfoEndDt       object 
 3   SecurityBegDt      object 
 4   SecurityEndDt      object 
 5   SecurityHdrFlg     object 
 6   HdrCUSIP           object 
 7   HdrCUSIP9          object 
 8   CUSIP              object 
 9   CUSIP9             object 
 10  PrimaryExch        object 
 11  ConditionalType    object 
 12  ExchangeTier       object 
 13  TradingStatusFlg   object 
 14  SecurityNm         object 
 15  ShareClass         object 
 16  USIncFlg           object 
 17  IssuerType         object 
 18  SecurityType       object 
 19  SecuritySubType    object 
 20  ShareType          object 
 21  SecurityActiveFlg  object 
 22  DelActionType      object 
 23  DelStatusType      object 
 24  DelReasonType      object 
 25  DelPaymentType     object 
 26  Ticker             object 
 27  TradingSymbol      object 
 28  PERMCO             int64  
 29  SICCD              int64  
 30  NAICS              int64  
 31  ICBIndustry        object 
 32  IssuerNm           object 
 33  YYYYMMDD           int64  
 34  DlyCalDt           object 
 35  DlyDelFlg          object 
 36  DlyPrc             float64
 37  DlyPrcFlg          object 
 38  DlyCap             float64
 39  DlyCapFlg          object 
 40  DlyPrevPrc         float64
 41  DlyPrevPrcFlg      object 
 42  DlyPrevDt          object 
 43  DlyPrevCap         float64
 44  DlyPrevCapFlg      object 
 45  DlyRet             float64
 46  DlyRetx            float64
 47  DlyRetI            float64
 48  DlyRetMissFlg      object 
 49  DlyRetDurFlg       object 
 50  DlyOrdDivAmt       float64
 51  DlyNonOrdDivAmt    float64
 52  DlyFacPrc          float64
 53  DlyDistRetFlg      object 
 54  DlyVol             float64
 55  DlyClose           float64
 56  DlyLow             float64
 57  DlyHigh            float64
 58  DlyBid             float64
 59  DlyAsk             float64
 60  DlyOpen            float64
 61  DlyNumTrd          float64
 62  DlyMMCnt           float64
 63  DlyPrcVol          float64
dtypes: float64(20), int64(5), object(39)
memory usage: 4.2+ GB
df
PERMNO SecInfoStartDt SecInfoEndDt SecurityBegDt SecurityEndDt SecurityHdrFlg HdrCUSIP HdrCUSIP9 CUSIP CUSIP9 ... DlyVol DlyClose DlyLow DlyHigh DlyBid DlyAsk DlyOpen DlyNumTrd DlyMMCnt DlyPrcVol
0 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 88291.0 181.68 181.440 185.30 181.67 181.70 185.30 2300.0 44.0 1.604071e+07
1 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 71463.0 184.91 180.890 185.17 184.90 184.91 180.89 1572.0 44.0 1.321422e+07
2 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 70308.0 185.07 183.140 185.52 184.77 185.07 184.18 1764.0 44.0 1.301190e+07
3 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 72267.0 183.03 182.475 184.63 182.90 183.22 184.40 2032.0 44.0 1.322703e+07
4 10026 2019-08-05 2021-04-26 1986-02-04 2023-12-29 N 46603210 466032109 46603210 466032109 ... 118592.0 182.04 181.810 183.89 182.04 182.23 182.73 2208.0 45.0 2.158849e+07
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8883151 93436 2023-10-01 2023-12-29 2010-06-29 2023-12-29 Y 88160R10 88160R101 88160R10 88160R101 ... 93148541.0 252.54 251.370 258.22 252.53 252.54 256.76 972876.0 54.0 2.352373e+10
8883152 93436 2023-10-01 2023-12-29 2010-06-29 2023-12-29 Y 88160R10 88160R101 88160R10 88160R101 ... 86700724.0 256.61 252.910 257.97 256.66 256.71 254.49 954492.0 54.0 2.224827e+10
8883153 93436 2023-10-01 2023-12-29 2010-06-29 2023-12-29 Y 88160R10 88160R101 88160R10 88160R101 ... 106250779.0 261.44 257.520 263.34 261.52 261.56 258.35 1046971.0 54.0 2.777820e+10
8883154 93436 2023-10-01 2023-12-29 2010-06-29 2023-12-29 Y 88160R10 88160R101 88160R10 88160R101 ... 113250680.0 253.18 252.710 265.13 253.08 253.19 263.66 1166971.0 54.0 2.867281e+10
8883155 93436 2023-10-01 2023-12-29 2010-06-29 2023-12-29 Y 88160R10 88160R101 88160R10 88160R101 ... 100321201.0 248.48 247.430 255.19 248.48 248.49 255.10 1062240.0 54.0 2.492781e+10

8883156 rows × 64 columns

Tip 2: Use the pyarrow library with pandas

The c engine was historically the fastest engine for reading CSV files in pandas. It is the default engine and is written in C. The python engine is the slowest engine and is written in Python, and is used when the user specifies parameters that are not supported by the c engine. Pandas 2.0 added support for the pyarrow engine is a new engine that leverages the pyarrow library. It can be faster and more memory efficient than the c engine because it supports multi-threading. Let’s compare the performance of the c engine with the pyarrow engine for reading the sample file (I tried with the python, but I got tired of waiting…). Note that the pyarrow engine should be faster when using the pyarrow dtype backend, so we will compare the performance of the two engines with the default, the numpy_nullable and the pyarrow dtype backends.

date_cols = [
    "SecInfoStartDt",
    "SecInfoEndDt",
    "SecurityBegDt",
    "SecurityEndDt",
    "YYYYMMDD",
    "DlyCalDt",
    "DlyPrevDt",
]
x = time.time()
df = pd.read_csv("files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="c")
time_c_default = time.time() - x
df.info()
print(f"Reading using the c engine and default backend: {time_c_default:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/3729689134.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="c")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype         
---  ------             -----         
 0   PERMNO             int64         
 1   SecInfoStartDt     datetime64[ns]
 2   SecInfoEndDt       datetime64[ns]
 3   SecurityBegDt      datetime64[ns]
 4   SecurityEndDt      datetime64[ns]
 5   SecurityHdrFlg     object        
 6   HdrCUSIP           object        
 7   HdrCUSIP9          object        
 8   CUSIP              object        
 9   CUSIP9             object        
 10  PrimaryExch        object        
 11  ConditionalType    object        
 12  ExchangeTier       object        
 13  TradingStatusFlg   object        
 14  SecurityNm         object        
 15  ShareClass         object        
 16  USIncFlg           object        
 17  IssuerType         object        
 18  SecurityType       object        
 19  SecuritySubType    object        
 20  ShareType          object        
 21  SecurityActiveFlg  object        
 22  DelActionType      object        
 23  DelStatusType      object        
 24  DelReasonType      object        
 25  DelPaymentType     object        
 26  Ticker             object        
 27  TradingSymbol      object        
 28  PERMCO             int64         
 29  SICCD              int64         
 30  NAICS              int64         
 31  ICBIndustry        object        
 32  IssuerNm           object        
 33  YYYYMMDD           datetime64[ns]
 34  DlyCalDt           datetime64[ns]
 35  DlyDelFlg          object        
 36  DlyPrc             float64       
 37  DlyPrcFlg          object        
 38  DlyCap             float64       
 39  DlyCapFlg          object        
 40  DlyPrevPrc         float64       
 41  DlyPrevPrcFlg      object        
 42  DlyPrevDt          datetime64[ns]
 43  DlyPrevCap         float64       
 44  DlyPrevCapFlg      object        
 45  DlyRet             float64       
 46  DlyRetx            float64       
 47  DlyRetI            float64       
 48  DlyRetMissFlg      object        
 49  DlyRetDurFlg       object        
 50  DlyOrdDivAmt       float64       
 51  DlyNonOrdDivAmt    float64       
 52  DlyFacPrc          float64       
 53  DlyDistRetFlg      object        
 54  DlyVol             float64       
 55  DlyClose           float64       
 56  DlyLow             float64       
 57  DlyHigh            float64       
 58  DlyBid             float64       
 59  DlyAsk             float64       
 60  DlyOpen            float64       
 61  DlyNumTrd          float64       
 62  DlyMMCnt           float64       
 63  DlyPrcVol          float64       
dtypes: datetime64[ns](7), float64(20), int64(4), object(33)
memory usage: 4.2+ GB
Reading using the c engine and default backend: 28.3934 seconds
x = time.time()
df = pd.read_csv(
    "files/crsp_dsf.csv.gz",
    parse_dates=date_cols,
    engine="c",
    dtype_backend="numpy_nullable",
)
time_c_np = time.time() - x
df.info()
print(f"Reading using the c engine and numpy_nullable backend: {time_c_np:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/1999027842.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype         
---  ------             -----         
 0   PERMNO             Int64         
 1   SecInfoStartDt     datetime64[ns]
 2   SecInfoEndDt       datetime64[ns]
 3   SecurityBegDt      datetime64[ns]
 4   SecurityEndDt      datetime64[ns]
 5   SecurityHdrFlg     string        
 6   HdrCUSIP           object        
 7   HdrCUSIP9          object        
 8   CUSIP              object        
 9   CUSIP9             object        
 10  PrimaryExch        string        
 11  ConditionalType    string        
 12  ExchangeTier       object        
 13  TradingStatusFlg   string        
 14  SecurityNm         string        
 15  ShareClass         object        
 16  USIncFlg           string        
 17  IssuerType         string        
 18  SecurityType       string        
 19  SecuritySubType    string        
 20  ShareType          string        
 21  SecurityActiveFlg  string        
 22  DelActionType      object        
 23  DelStatusType      string        
 24  DelReasonType      string        
 25  DelPaymentType     string        
 26  Ticker             string        
 27  TradingSymbol      string        
 28  PERMCO             Int64         
 29  SICCD              Int64         
 30  NAICS              Int64         
 31  ICBIndustry        string        
 32  IssuerNm           string        
 33  YYYYMMDD           datetime64[ns]
 34  DlyCalDt           datetime64[ns]
 35  DlyDelFlg          string        
 36  DlyPrc             Float64       
 37  DlyPrcFlg          string        
 38  DlyCap             Float64       
 39  DlyCapFlg          string        
 40  DlyPrevPrc         Float64       
 41  DlyPrevPrcFlg      string        
 42  DlyPrevDt          object        
 43  DlyPrevCap         Float64       
 44  DlyPrevCapFlg      string        
 45  DlyRet             Float64       
 46  DlyRetx            Float64       
 47  DlyRetI            Float64       
 48  DlyRetMissFlg      object        
 49  DlyRetDurFlg       string        
 50  DlyOrdDivAmt       Float64       
 51  DlyNonOrdDivAmt    Float64       
 52  DlyFacPrc          Float64       
 53  DlyDistRetFlg      string        
 54  DlyVol             Int64         
 55  DlyClose           Float64       
 56  DlyLow             Float64       
 57  DlyHigh            Float64       
 58  DlyBid             Float64       
 59  DlyAsk             Float64       
 60  DlyOpen            Float64       
 61  DlyNumTrd          Int64         
 62  DlyMMCnt           Int64         
 63  DlyPrcVol          Float64       
dtypes: Float64(17), Int64(7), datetime64[ns](6), object(9), string(25)
memory usage: 4.4+ GB
Reading using the c engine and numpy_nullable backend: 34.7816 seconds
x = time.time()
df = pd.read_csv(
    "files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="c", dtype_backend="pyarrow"
)
time_c_arrow = time.time() - x
df.info()
print(f"Reading using the c engine and pyarrow backend: {time_c_arrow:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/2542695645.py:2: DtypeWarning: Columns (6,7,8,9) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype          
---  ------             -----          
 0   PERMNO             int64[pyarrow] 
 1   SecInfoStartDt     datetime64[ns] 
 2   SecInfoEndDt       datetime64[ns] 
 3   SecurityBegDt      datetime64[ns] 
 4   SecurityEndDt      datetime64[ns] 
 5   SecurityHdrFlg     string[pyarrow]
 6   HdrCUSIP           object         
 7   HdrCUSIP9          object         
 8   CUSIP              object         
 9   CUSIP9             object         
 10  PrimaryExch        string[pyarrow]
 11  ConditionalType    string[pyarrow]
 12  ExchangeTier       string[pyarrow]
 13  TradingStatusFlg   string[pyarrow]
 14  SecurityNm         string[pyarrow]
 15  ShareClass         string[pyarrow]
 16  USIncFlg           string[pyarrow]
 17  IssuerType         string[pyarrow]
 18  SecurityType       string[pyarrow]
 19  SecuritySubType    string[pyarrow]
 20  ShareType          string[pyarrow]
 21  SecurityActiveFlg  string[pyarrow]
 22  DelActionType      string[pyarrow]
 23  DelStatusType      string[pyarrow]
 24  DelReasonType      string[pyarrow]
 25  DelPaymentType     string[pyarrow]
 26  Ticker             string[pyarrow]
 27  TradingSymbol      string[pyarrow]
 28  PERMCO             int64[pyarrow] 
 29  SICCD              int64[pyarrow] 
 30  NAICS              int64[pyarrow] 
 31  ICBIndustry        string[pyarrow]
 32  IssuerNm           string[pyarrow]
 33  YYYYMMDD           datetime64[ns] 
 34  DlyCalDt           datetime64[ns] 
 35  DlyDelFlg          string[pyarrow]
 36  DlyPrc             double[pyarrow]
 37  DlyPrcFlg          string[pyarrow]
 38  DlyCap             double[pyarrow]
 39  DlyCapFlg          string[pyarrow]
 40  DlyPrevPrc         double[pyarrow]
 41  DlyPrevPrcFlg      string[pyarrow]
 42  DlyPrevDt          object         
 43  DlyPrevCap         double[pyarrow]
 44  DlyPrevCapFlg      string[pyarrow]
 45  DlyRet             double[pyarrow]
 46  DlyRetx            double[pyarrow]
 47  DlyRetI            double[pyarrow]
 48  DlyRetMissFlg      string[pyarrow]
 49  DlyRetDurFlg       string[pyarrow]
 50  DlyOrdDivAmt       double[pyarrow]
 51  DlyNonOrdDivAmt    double[pyarrow]
 52  DlyFacPrc          double[pyarrow]
 53  DlyDistRetFlg      string[pyarrow]
 54  DlyVol             int64[pyarrow] 
 55  DlyClose           double[pyarrow]
 56  DlyLow             double[pyarrow]
 57  DlyHigh            double[pyarrow]
 58  DlyBid             double[pyarrow]
 59  DlyAsk             double[pyarrow]
 60  DlyOpen            double[pyarrow]
 61  DlyNumTrd          int64[pyarrow] 
 62  DlyMMCnt           int64[pyarrow] 
 63  DlyPrcVol          double[pyarrow]
dtypes: datetime64[ns](6), double[pyarrow](17), int64[pyarrow](7), object(5), string[pyarrow](29)
memory usage: 4.2+ GB
Reading using the c engine and pyarrow backend: 51.4582 seconds
x = time.time()
df = pd.read_csv("files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="pyarrow")
time_arrow_default = time.time() - x
df.info()
print(
    f"Reading using the pyarrow engine and default backend: {time_arrow_default:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype         
---  ------             -----         
 0   PERMNO             int64         
 1   SecInfoStartDt     datetime64[ns]
 2   SecInfoEndDt       datetime64[ns]
 3   SecurityBegDt      datetime64[ns]
 4   SecurityEndDt      datetime64[ns]
 5   SecurityHdrFlg     object        
 6   HdrCUSIP           object        
 7   HdrCUSIP9          object        
 8   CUSIP              object        
 9   CUSIP9             object        
 10  PrimaryExch        object        
 11  ConditionalType    object        
 12  ExchangeTier       object        
 13  TradingStatusFlg   object        
 14  SecurityNm         object        
 15  ShareClass         object        
 16  USIncFlg           object        
 17  IssuerType         object        
 18  SecurityType       object        
 19  SecuritySubType    object        
 20  ShareType          object        
 21  SecurityActiveFlg  object        
 22  DelActionType      object        
 23  DelStatusType      object        
 24  DelReasonType      object        
 25  DelPaymentType     object        
 26  Ticker             object        
 27  TradingSymbol      object        
 28  PERMCO             int64         
 29  SICCD              int64         
 30  NAICS              int64         
 31  ICBIndustry        object        
 32  IssuerNm           object        
 33  YYYYMMDD           datetime64[ns]
 34  DlyCalDt           datetime64[ns]
 35  DlyDelFlg          object        
 36  DlyPrc             float64       
 37  DlyPrcFlg          object        
 38  DlyCap             float64       
 39  DlyCapFlg          object        
 40  DlyPrevPrc         float64       
 41  DlyPrevPrcFlg      object        
 42  DlyPrevDt          object        
 43  DlyPrevCap         float64       
 44  DlyPrevCapFlg      object        
 45  DlyRet             float64       
 46  DlyRetx            float64       
 47  DlyRetI            float64       
 48  DlyRetMissFlg      object        
 49  DlyRetDurFlg       object        
 50  DlyOrdDivAmt       float64       
 51  DlyNonOrdDivAmt    float64       
 52  DlyFacPrc          float64       
 53  DlyDistRetFlg      object        
 54  DlyVol             float64       
 55  DlyClose           float64       
 56  DlyLow             float64       
 57  DlyHigh            float64       
 58  DlyBid             float64       
 59  DlyAsk             float64       
 60  DlyOpen            float64       
 61  DlyNumTrd          float64       
 62  DlyMMCnt           float64       
 63  DlyPrcVol          float64       
dtypes: datetime64[ns](6), float64(20), int64(4), object(34)
memory usage: 4.2+ GB
Reading using the pyarrow engine and default backend: 21.9848 seconds
x = time.time()
df = pd.read_csv(
    "files/crsp_dsf.csv.gz",
    parse_dates=date_cols,
    engine="pyarrow",
    dtype_backend="numpy_nullable",
)
time_arrow_np = time.time() - x
df.info()
print(
    f"Reading using the pyarrow engine and numpy_nullable backend: {time_arrow_np:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype         
---  ------             -----         
 0   PERMNO             Int64         
 1   SecInfoStartDt     datetime64[ns]
 2   SecInfoEndDt       datetime64[ns]
 3   SecurityBegDt      datetime64[ns]
 4   SecurityEndDt      datetime64[ns]
 5   SecurityHdrFlg     string        
 6   HdrCUSIP           string        
 7   HdrCUSIP9          string        
 8   CUSIP              string        
 9   CUSIP9             string        
 10  PrimaryExch        string        
 11  ConditionalType    string        
 12  ExchangeTier       string        
 13  TradingStatusFlg   string        
 14  SecurityNm         string        
 15  ShareClass         string        
 16  USIncFlg           string        
 17  IssuerType         string        
 18  SecurityType       string        
 19  SecuritySubType    string        
 20  ShareType          string        
 21  SecurityActiveFlg  string        
 22  DelActionType      string        
 23  DelStatusType      string        
 24  DelReasonType      string        
 25  DelPaymentType     string        
 26  Ticker             string        
 27  TradingSymbol      string        
 28  PERMCO             Int64         
 29  SICCD              Int64         
 30  NAICS              Int64         
 31  ICBIndustry        string        
 32  IssuerNm           string        
 33  YYYYMMDD           datetime64[ns]
 34  DlyCalDt           datetime64[ns]
 35  DlyDelFlg          string        
 36  DlyPrc             Float64       
 37  DlyPrcFlg          string        
 38  DlyCap             Float64       
 39  DlyCapFlg          string        
 40  DlyPrevPrc         Float64       
 41  DlyPrevPrcFlg      string        
 42  DlyPrevDt          object        
 43  DlyPrevCap         Float64       
 44  DlyPrevCapFlg      string        
 45  DlyRet             Float64       
 46  DlyRetx            Float64       
 47  DlyRetI            Float64       
 48  DlyRetMissFlg      string        
 49  DlyRetDurFlg       string        
 50  DlyOrdDivAmt       Float64       
 51  DlyNonOrdDivAmt    Float64       
 52  DlyFacPrc          Float64       
 53  DlyDistRetFlg      string        
 54  DlyVol             Int64         
 55  DlyClose           Float64       
 56  DlyLow             Float64       
 57  DlyHigh            Float64       
 58  DlyBid             Float64       
 59  DlyAsk             Float64       
 60  DlyOpen            Float64       
 61  DlyNumTrd          Int64         
 62  DlyMMCnt           Int64         
 63  DlyPrcVol          Float64       
dtypes: Float64(17), Int64(7), datetime64[ns](6), object(1), string(33)
memory usage: 4.4+ GB
Reading using the pyarrow engine and numpy_nullable backend: 33.7553 seconds
x = time.time()
df = pd.read_csv(
    "files/crsp_dsf.csv.gz",
    parse_dates=date_cols,
    engine="pyarrow",
    dtype_backend="pyarrow",
)
time_arrow_arrow = time.time() - x
df.info()
print(
    f"Reading using the pyarrow engine and pyarrow backend: {time_arrow_arrow:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype               
---  ------             -----               
 0   PERMNO             int64[pyarrow]      
 1   SecInfoStartDt     date32[day][pyarrow]
 2   SecInfoEndDt       date32[day][pyarrow]
 3   SecurityBegDt      date32[day][pyarrow]
 4   SecurityEndDt      date32[day][pyarrow]
 5   SecurityHdrFlg     string[pyarrow]     
 6   HdrCUSIP           string[pyarrow]     
 7   HdrCUSIP9          string[pyarrow]     
 8   CUSIP              string[pyarrow]     
 9   CUSIP9             string[pyarrow]     
 10  PrimaryExch        string[pyarrow]     
 11  ConditionalType    string[pyarrow]     
 12  ExchangeTier       string[pyarrow]     
 13  TradingStatusFlg   string[pyarrow]     
 14  SecurityNm         string[pyarrow]     
 15  ShareClass         string[pyarrow]     
 16  USIncFlg           string[pyarrow]     
 17  IssuerType         string[pyarrow]     
 18  SecurityType       string[pyarrow]     
 19  SecuritySubType    string[pyarrow]     
 20  ShareType          string[pyarrow]     
 21  SecurityActiveFlg  string[pyarrow]     
 22  DelActionType      string[pyarrow]     
 23  DelStatusType      string[pyarrow]     
 24  DelReasonType      string[pyarrow]     
 25  DelPaymentType     string[pyarrow]     
 26  Ticker             string[pyarrow]     
 27  TradingSymbol      string[pyarrow]     
 28  PERMCO             int64[pyarrow]      
 29  SICCD              int64[pyarrow]      
 30  NAICS              int64[pyarrow]      
 31  ICBIndustry        string[pyarrow]     
 32  IssuerNm           string[pyarrow]     
 33  YYYYMMDD           datetime64[ns]      
 34  DlyCalDt           date32[day][pyarrow]
 35  DlyDelFlg          string[pyarrow]     
 36  DlyPrc             double[pyarrow]     
 37  DlyPrcFlg          string[pyarrow]     
 38  DlyCap             double[pyarrow]     
 39  DlyCapFlg          string[pyarrow]     
 40  DlyPrevPrc         double[pyarrow]     
 41  DlyPrevPrcFlg      string[pyarrow]     
 42  DlyPrevDt          date32[day][pyarrow]
 43  DlyPrevCap         double[pyarrow]     
 44  DlyPrevCapFlg      string[pyarrow]     
 45  DlyRet             double[pyarrow]     
 46  DlyRetx            double[pyarrow]     
 47  DlyRetI            double[pyarrow]     
 48  DlyRetMissFlg      string[pyarrow]     
 49  DlyRetDurFlg       string[pyarrow]     
 50  DlyOrdDivAmt       double[pyarrow]     
 51  DlyNonOrdDivAmt    double[pyarrow]     
 52  DlyFacPrc          double[pyarrow]     
 53  DlyDistRetFlg      string[pyarrow]     
 54  DlyVol             int64[pyarrow]      
 55  DlyClose           double[pyarrow]     
 56  DlyLow             double[pyarrow]     
 57  DlyHigh            double[pyarrow]     
 58  DlyBid             double[pyarrow]     
 59  DlyAsk             double[pyarrow]     
 60  DlyOpen            double[pyarrow]     
 61  DlyNumTrd          int64[pyarrow]      
 62  DlyMMCnt           int64[pyarrow]      
 63  DlyPrcVol          double[pyarrow]     
dtypes: date32[day][pyarrow](6), datetime64[ns](1), double[pyarrow](17), int64[pyarrow](7), string[pyarrow](33)
memory usage: 4.2 GB
Reading using the pyarrow engine and pyarrow backend: 6.4519 seconds

Tip 3: Store the DataFrame in the Parquet format

An even better way to avoid unnecessary parsing is to use a more efficient file format. CSV files are easy to create and read, but they are not the most efficient format. Because they store the data as text, pandas has to parse the data every time we read the file. The Apache Parquet file format is more efficient because it stores the data in a binary format that is optimized for efficient reading. This means that pandas does not need to parse the data again every time we read the file. In addition, Parquet files are compressed by default, so they take less space on disk.

Parquet files are columnar, which means that the data is stored by column. This is especially efficient if you only need to read a subset of the columns. Another efficient columnar file format that is supported by pandas is Apache ORC. However, it is not as widely supported as Parquet.

Parquet files are supported by pandas through the pyarrow library, so you will need to install it if you want to use Parquet files.

FIX In the example below, we can see that both the writing and reading of the Parquet file are faster than the writing and reading of the CSV file. Reading the parquet file is almost 5 times faster than reading the CSV file, and the disk space used by the parquet file is only 73.2 MB compared to 78.1 MB for the compressed CSV file.

x = time.time()
df.to_parquet("files/crsp_dsf.parquet", engine="pyarrow", index=False)
time_write_parquet = time.time() - x
print(
    f"Writing to Parquet using the pyarrow engine and pyarrow backend: {time_write_parquet} seconds"
)
Writing to Parquet using the pyarrow engine and pyarrow backend: 7.636731147766113 seconds
x = time.time()
df = pd.read_parquet(
    "files/crsp_dsf.parquet", engine="pyarrow", dtype_backend="pyarrow"
)
time_parquet_arrow_arrow = time.time() - x
df.info()
print(
    f"Reading Parquet file using the pyarrow engine and pyarrow backend: {time_parquet_arrow_arrow} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
 #   Column             Dtype                 
---  ------             -----                 
 0   PERMNO             int64[pyarrow]        
 1   SecInfoStartDt     date32[day][pyarrow]  
 2   SecInfoEndDt       date32[day][pyarrow]  
 3   SecurityBegDt      date32[day][pyarrow]  
 4   SecurityEndDt      date32[day][pyarrow]  
 5   SecurityHdrFlg     string[pyarrow]       
 6   HdrCUSIP           string[pyarrow]       
 7   HdrCUSIP9          string[pyarrow]       
 8   CUSIP              string[pyarrow]       
 9   CUSIP9             string[pyarrow]       
 10  PrimaryExch        string[pyarrow]       
 11  ConditionalType    string[pyarrow]       
 12  ExchangeTier       string[pyarrow]       
 13  TradingStatusFlg   string[pyarrow]       
 14  SecurityNm         string[pyarrow]       
 15  ShareClass         string[pyarrow]       
 16  USIncFlg           string[pyarrow]       
 17  IssuerType         string[pyarrow]       
 18  SecurityType       string[pyarrow]       
 19  SecuritySubType    string[pyarrow]       
 20  ShareType          string[pyarrow]       
 21  SecurityActiveFlg  string[pyarrow]       
 22  DelActionType      string[pyarrow]       
 23  DelStatusType      string[pyarrow]       
 24  DelReasonType      string[pyarrow]       
 25  DelPaymentType     string[pyarrow]       
 26  Ticker             string[pyarrow]       
 27  TradingSymbol      string[pyarrow]       
 28  PERMCO             int64[pyarrow]        
 29  SICCD              int64[pyarrow]        
 30  NAICS              int64[pyarrow]        
 31  ICBIndustry        string[pyarrow]       
 32  IssuerNm           string[pyarrow]       
 33  YYYYMMDD           timestamp[ns][pyarrow]
 34  DlyCalDt           date32[day][pyarrow]  
 35  DlyDelFlg          string[pyarrow]       
 36  DlyPrc             double[pyarrow]       
 37  DlyPrcFlg          string[pyarrow]       
 38  DlyCap             double[pyarrow]       
 39  DlyCapFlg          string[pyarrow]       
 40  DlyPrevPrc         double[pyarrow]       
 41  DlyPrevPrcFlg      string[pyarrow]       
 42  DlyPrevDt          date32[day][pyarrow]  
 43  DlyPrevCap         double[pyarrow]       
 44  DlyPrevCapFlg      string[pyarrow]       
 45  DlyRet             double[pyarrow]       
 46  DlyRetx            double[pyarrow]       
 47  DlyRetI            double[pyarrow]       
 48  DlyRetMissFlg      string[pyarrow]       
 49  DlyRetDurFlg       string[pyarrow]       
 50  DlyOrdDivAmt       double[pyarrow]       
 51  DlyNonOrdDivAmt    double[pyarrow]       
 52  DlyFacPrc          double[pyarrow]       
 53  DlyDistRetFlg      string[pyarrow]       
 54  DlyVol             int64[pyarrow]        
 55  DlyClose           double[pyarrow]       
 56  DlyLow             double[pyarrow]       
 57  DlyHigh            double[pyarrow]       
 58  DlyBid             double[pyarrow]       
 59  DlyAsk             double[pyarrow]       
 60  DlyOpen            double[pyarrow]       
 61  DlyNumTrd          int64[pyarrow]        
 62  DlyMMCnt           int64[pyarrow]        
 63  DlyPrcVol          double[pyarrow]       
dtypes: date32[day][pyarrow](6), double[pyarrow](17), int64[pyarrow](7), string[pyarrow](33), timestamp[ns][pyarrow](1)
memory usage: 4.2 GB
Reading Parquet file using the pyarrow engine and pyarrow backend: 0.7189159393310547 seconds

Tip 4: Only read the columns you need

The second tip is to only read the columns you need. This is especially important if you are working with large files as it will save memory and time. We can specify the columns we want with the usecols parameter of the read_csv function or the columns parameter of the read_parquet function. Note that for CSV files the saved time is mostly because we do not have to parse the columns we are not interested in. However, because CSV files are not indexed, we still have to read the entire file. Parquet files are indexed and columnar, so the `pyarrow`` engine does not have to read the entire file if we only need a subset of the columns.

cols = [
    "DlyCalDt",
    "PERMNO",
    "CUSIP",
    "Ticker",
    "PrimaryExch",
    "SecurityType",
    "SICCD",
    "DlyRet",
    "DlyPrc",
]
x = time.time()
df = pd.read_csv(
    "files/crsp_dsf.csv.gz", usecols=cols, parse_dates=["DlyCalDt"], engine="c"
)
time_c_default_subset = time.time() - x
df.info()
print(
    f"Reading a subset of columns using the c engine and default backend: {time_c_default_subset:.4f} seconds"
)
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/3713586357.py:2: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 9 columns):
 #   Column        Dtype         
---  ------        -----         
 0   PERMNO        int64         
 1   CUSIP         object        
 2   PrimaryExch   object        
 3   SecurityType  object        
 4   Ticker        object        
 5   SICCD         int64         
 6   DlyCalDt      datetime64[ns]
 7   DlyPrc        float64       
 8   DlyRet        float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 610.0+ MB
Reading a subset of columns using the c engine and default backend: 11.7283 seconds
x = time.time()
df = pd.read_csv(
    "files/crsp_dsf.csv.gz",
    usecols=cols,
    parse_dates=["DlyCalDt"],
    engine="pyarrow",
    dtype_backend="pyarrow",
)
time_arrow_arrow_subset = time.time() - x
df.info()
print(
    f"Reading a subset of columns using the pyarrow engine and pyarrow backend: {time_arrow_arrow_subset:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 9 columns):
 #   Column        Dtype               
---  ------        -----               
 0   DlyCalDt      date32[day][pyarrow]
 1   PERMNO        int64[pyarrow]      
 2   CUSIP         string[pyarrow]     
 3   Ticker        string[pyarrow]     
 4   PrimaryExch   string[pyarrow]     
 5   SecurityType  string[pyarrow]     
 6   SICCD         int64[pyarrow]      
 7   DlyRet        double[pyarrow]     
 8   DlyPrc        double[pyarrow]     
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 583.0 MB
Reading a subset of columns using the pyarrow engine and pyarrow backend: 2.9267 seconds
x = time.time()
df = pd.read_parquet(
    "files/crsp_dsf.parquet", columns=cols, engine="pyarrow", dtype_backend="pyarrow"
)
time_parquet_arrow_arrow_subset = time.time() - x
df.info()
print(
    f"Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend: {time_parquet_arrow_arrow_subset:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 9 columns):
 #   Column        Dtype               
---  ------        -----               
 0   DlyCalDt      date32[day][pyarrow]
 1   PERMNO        int64[pyarrow]      
 2   CUSIP         string[pyarrow]     
 3   Ticker        string[pyarrow]     
 4   PrimaryExch   string[pyarrow]     
 5   SecurityType  string[pyarrow]     
 6   SICCD         int64[pyarrow]      
 7   DlyRet        double[pyarrow]     
 8   DlyPrc        double[pyarrow]     
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 588.7 MB
Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend: 0.0876 seconds

Tip 5 (Parquet-only): Only read the rows you need

Pandas’ read_parquet() function passes all the arguments to the pyarrow.parquet.read_table() function, so we can use the filters parameter to only read the rows we need. The filters parameter takes a list of tuples where each tuple contains the column name, the operator, and the value. For example, if we only want to read the rows for January 2023, we can use the following filter:

from datetime import date

filters = [("DlyCalDt", ">=", date(2023, 1, 1)), ("DlyCalDt", "<", date(2023, 2, 1))]

x = time.time()
df = pd.read_parquet(
    "files/crsp_dsf.parquet",
    columns=cols,
    filters=filters,
    engine="pyarrow",
    dtype_backend="pyarrow",
)
time_parquet_arrow_arrow_subset = time.time() - x
df.info()
print(
    f"Reading a subset of rows and columns from Parquet file using the pyarrow engine and pyarrow backend: {time_parquet_arrow_arrow_subset:.4f} seconds"
)
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189762 entries, 0 to 189761
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype               
---  ------        --------------   -----               
 0   DlyCalDt      189762 non-null  date32[day][pyarrow]
 1   PERMNO        189762 non-null  int64[pyarrow]      
 2   CUSIP         189695 non-null  string[pyarrow]     
 3   Ticker        189655 non-null  string[pyarrow]     
 4   PrimaryExch   189762 non-null  string[pyarrow]     
 5   SecurityType  189695 non-null  string[pyarrow]     
 6   SICCD         189762 non-null  int64[pyarrow]      
 7   DlyRet        189592 non-null  double[pyarrow]     
 8   DlyPrc        189633 non-null  double[pyarrow]     
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 12.6 MB
Reading a subset of rows and columns from Parquet file using the pyarrow engine and pyarrow backend: 0.0680 seconds
DlyCalDt PERMNO CUSIP Ticker PrimaryExch SecurityType SICCD DlyRet DlyPrc
0 2023-01-03 10026 46603210 JJSF Q EQTY 2052 0.011823 151.48
1 2023-01-04 10026 46603210 JJSF Q EQTY 2052 -0.001716 151.22
2 2023-01-05 10026 46603210 JJSF Q EQTY 2052 -0.010713 149.60
3 2023-01-06 10026 46603210 JJSF Q EQTY 2052 0.020321 152.64
4 2023-01-09 10026 46603210 JJSF Q EQTY 2052 -0.019851 149.61
... ... ... ... ... ... ... ... ... ...
189757 2023-01-25 93436 88160R10 TSLA Q EQTY 9999 0.003753 144.43
189758 2023-01-26 93436 88160R10 TSLA Q EQTY 9999 0.109673 160.27
189759 2023-01-27 93436 88160R10 TSLA Q EQTY 9999 0.110002 177.90
189760 2023-01-30 93436 88160R10 TSLA Q EQTY 9999 -0.063182 166.66
189761 2023-01-31 93436 88160R10 TSLA Q EQTY 9999 0.039362 173.22

189762 rows × 9 columns

Note that this is not necessarily faster than reading the entire file and then filtering the rows with pandas, but it can be more memory efficient because we do not have to read the entire file into memory.

x = time.time()
df = pd.read_parquet(
    "files/crsp_dsf.parquet",
    columns=cols,
    filters=filters,
    engine="pyarrow",
    dtype_backend="pyarrow",
)
df = df[(df["DlyCalDt"] >= date(2023, 1, 1)) & (df["DlyCalDt"] < date(2023, 2, 1))]

time_parquet_arrow_arrow_subset = time.time() - x
df.info()
print(
    f"Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend and filtering rows using pandas: {time_parquet_arrow_arrow_subset:.4f} seconds"
)
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189762 entries, 0 to 189761
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype               
---  ------        --------------   -----               
 0   DlyCalDt      189762 non-null  date32[day][pyarrow]
 1   PERMNO        189762 non-null  int64[pyarrow]      
 2   CUSIP         189695 non-null  string[pyarrow]     
 3   Ticker        189655 non-null  string[pyarrow]     
 4   PrimaryExch   189762 non-null  string[pyarrow]     
 5   SecurityType  189695 non-null  string[pyarrow]     
 6   SICCD         189762 non-null  int64[pyarrow]      
 7   DlyRet        189592 non-null  double[pyarrow]     
 8   DlyPrc        189633 non-null  double[pyarrow]     
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 12.6 MB
Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend and filtering rows using pandas: 0.0622 seconds
DlyCalDt PERMNO CUSIP Ticker PrimaryExch SecurityType SICCD DlyRet DlyPrc
0 2023-01-03 10026 46603210 JJSF Q EQTY 2052 0.011823 151.48
1 2023-01-04 10026 46603210 JJSF Q EQTY 2052 -0.001716 151.22
2 2023-01-05 10026 46603210 JJSF Q EQTY 2052 -0.010713 149.60
3 2023-01-06 10026 46603210 JJSF Q EQTY 2052 0.020321 152.64
4 2023-01-09 10026 46603210 JJSF Q EQTY 2052 -0.019851 149.61
... ... ... ... ... ... ... ... ... ...
189757 2023-01-25 93436 88160R10 TSLA Q EQTY 9999 0.003753 144.43
189758 2023-01-26 93436 88160R10 TSLA Q EQTY 9999 0.109673 160.27
189759 2023-01-27 93436 88160R10 TSLA Q EQTY 9999 0.110002 177.90
189760 2023-01-30 93436 88160R10 TSLA Q EQTY 9999 -0.063182 166.66
189761 2023-01-31 93436 88160R10 TSLA Q EQTY 9999 0.039362 173.22

189762 rows × 9 columns

Other benefits of using the pyarrow dtype backend

For casual pandas users, the pyarrow dtype backend provides long-needed features such as the ability to store missing values in integer and boolean columns1 and faster reading and writing of files.

However, it is meant to be more than just a replacement for the numpy dtype backend. The Apache Arrow project is a cross-language development platform for in-memory data that is designed to accelerate the performance of analytical workloads. The pyarrow library is the Python implementation of the Arrow project. One of the goals of the Arrow project is to provide a common in-memory data format for data interchange between different programming languages. This means that a pandas DataFrame with a pyarrow dtype backend can be used as an input to other Arrow-compatible libraries such as Polars and DuckDB, or even in a different programming language such as R or Julia. This lets you use the best tool for the job without having to worry about data interchange and the associated performance overhead.

Caveats

Experimental Features

The dtype_backend parameter of the read_csv function is still labeled as experimental in the pandas documentation, so use it with caution.

Dtypes

The data types (dtypes), i.e. the in-memory representation of the data are not the same for the default (aka numpy), the numpy_nullable, or the pyarrow dtype backends. This means that if your existing code defines the data types of the columns, you will need to update it if you switch to the pyarrow dtype backend. In most cases, all you will need to do is append [pyarrow] to the data type, e.g. int64[pyarrow] instead of int64. However, you should test your code to make sure that it works as expected. However, the types aren’t all the same; for example, the default dtype backend uses the object type to represent strings while the pyarrow dtype backend uses the string[pyarrow] type. Have a look at the pandas documentation for more information.

Conclusion

I hope these tips will save you time and memory when reading and parsing large CSV files in Python! As you can see, the Arrow library and the Parquet file format are powerful tools for reading and writing data, and it is worth exploring if you are working with large datasets. This is not surprising, as the Parquet format was created as a columnar storage format for the Hadoop big data ecosystem, and the Arrow library was created to provide a standard for in-memory data representation. The Arrow library is also used by other data processing tools such as DuckDB and Polars, so it is worth learning about it if you are working with large datasets.

Footnotes

  1. Note that the numpy_nullable backend has supported this for a while.↩︎

Reuse