import pandas as pd
import pyarrow as pa
import time
print(f"pandas version: {pd.__version__}")
print(f"pyarrow version: {pa.__version__}")
pandas version: 2.1.4
pyarrow version: 14.0.1
Vincent Grégoire, PhD, CFA
March 1, 2024
Thank you for Signing Up |
CSV files are ubiquitous in empirical finance and data science more generally. They are easy to create and read and are supported by most data processing tools. The pandas.read_csv()
function is flexible and easy to use, but it can be slow and memory-intensive for large files if not used properly.
In this post, I will show you how to read and parse large CSV files efficiently in Python using the pyarrow
library with pandas. I will also show you how to convert the CSV file to the Parquet format, which is a columnar storage format that is more efficient than CSV for reading and writing data.
This post is also available as a video tutorial on YouTube.
pyarrow
libraryAccording to the website,
Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system (or programming language to another).
A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
Let’s unpack this a bit:
Arrow provides a set of libraries for different programming languages, including C++, Java, Python, and R. The pyarrow
library is the Python implementation of Arrow. It provides a way to work with Arrow data in Python, including converting data between Arrow and other formats, such as CSV and Parquet. A big change that came with pandas 2.0 is thatpyarrow
can now be used with pandas
not only to read and write data, but also as a dtype-backend
, which means that pandas can use the Arrow in-memory format to store and process data.
Of note, Wes McKinney who created pandas is also a co-creator of Apache Arrow.
Reading large CSV files in Python with pandas can take time and slow down your workflow, especially if it is done repeatedly.
As an example, I use a large CSV file containing daily stock information such as prices, returns, and trading volumes for publicly listed U.S. stocks from the Center for Research in Security Prices (CRSP) that I obtained from Wharton Research Data Services (WRDS). The file contains 8,883,156 rows and 64 columns of different data types. The original file is a CSV file compressed in the gzip format (.csv.gz
) that is 486.4 MB in size. The uncompressed file is 3.91 GB in size.
The timing results presented in this post are obtained on a MacBook Pro with a M3 Max processor and 64 GB of RAM. If doing things properly, RAM should not be a limiting factor for reading and processing the sample file. My tips will focus on speed first and memory usage second.
First, let’s take a look at the first few rows of the file to understand its structure:
PERMNO | SecInfoStartDt | SecInfoEndDt | SecurityBegDt | SecurityEndDt | SecurityHdrFlg | HdrCUSIP | HdrCUSIP9 | CUSIP | CUSIP9 | ... | DlyVol | DlyClose | DlyLow | DlyHigh | DlyBid | DlyAsk | DlyOpen | DlyNumTrd | DlyMMCnt | DlyPrcVol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 88291 | 181.68 | 181.440 | 185.30 | 181.67 | 181.70 | 185.30 | 2300 | 44 | 16040708.9 |
1 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 71463 | 184.91 | 180.890 | 185.17 | 184.90 | 184.91 | 180.89 | 1572 | 44 | 13214223.3 |
2 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 70308 | 185.07 | 183.140 | 185.52 | 184.77 | 185.07 | 184.18 | 1764 | 44 | 13011901.6 |
3 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 72267 | 183.03 | 182.475 | 184.63 | 182.90 | 183.22 | 184.40 | 2032 | 44 | 13227029.0 |
4 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 118592 | 182.04 | 181.810 | 183.89 | 182.04 | 182.23 | 182.73 | 2208 | 45 | 21588487.7 |
5 rows × 64 columns
Reading a large CSV file with pandas is slow, but it is not slower if compressed. Compressed files take less space on disk and are sometimes faster to read. In my experience, if you are using a modern computer with a fast SSD, reading compressed and uncompressed files is about the same speed, but if you are using a hard drive or other slower storage, reading compressed files will often be faster.
The gzip
format is a good choice for compressing CSV files. It is supported by most data processing tools and is fast to read and write. I found that the xz
format is even better in terms of compression ratio for relatively similar performance, but it is not as widely supported as gzip
.
x = time.time()
df = pd.read_csv("files/crsp_dsf.csv")
print(f"Reading uncompressed file: {time.time() - x:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/1706856631.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("files/crsp_dsf.csv")
Reading uncompressed file: 23.8861 seconds
x = time.time()
df = pd.read_csv("files/crsp_dsf.csv.gz")
print(f"Reading compressed file: {time.time() - x:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/3793337616.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("files/crsp_dsf.csv.gz")
Reading compressed file: 25.3693 seconds
The times are very similar. Looking at the resulting DataFrame, we can see that the file contains 64 columns of different data types. The first few columns are identifiers such as the stock ticker, the date, and the exchange code. The remaining columns contain information about the stock such as prices, returns, and trading volumes. We can also see that the parsing of the file is incomplete as the dates are all of the object
type. We will fix this shortly.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO int64
1 SecInfoStartDt object
2 SecInfoEndDt object
3 SecurityBegDt object
4 SecurityEndDt object
5 SecurityHdrFlg object
6 HdrCUSIP object
7 HdrCUSIP9 object
8 CUSIP object
9 CUSIP9 object
10 PrimaryExch object
11 ConditionalType object
12 ExchangeTier object
13 TradingStatusFlg object
14 SecurityNm object
15 ShareClass object
16 USIncFlg object
17 IssuerType object
18 SecurityType object
19 SecuritySubType object
20 ShareType object
21 SecurityActiveFlg object
22 DelActionType object
23 DelStatusType object
24 DelReasonType object
25 DelPaymentType object
26 Ticker object
27 TradingSymbol object
28 PERMCO int64
29 SICCD int64
30 NAICS int64
31 ICBIndustry object
32 IssuerNm object
33 YYYYMMDD int64
34 DlyCalDt object
35 DlyDelFlg object
36 DlyPrc float64
37 DlyPrcFlg object
38 DlyCap float64
39 DlyCapFlg object
40 DlyPrevPrc float64
41 DlyPrevPrcFlg object
42 DlyPrevDt object
43 DlyPrevCap float64
44 DlyPrevCapFlg object
45 DlyRet float64
46 DlyRetx float64
47 DlyRetI float64
48 DlyRetMissFlg object
49 DlyRetDurFlg object
50 DlyOrdDivAmt float64
51 DlyNonOrdDivAmt float64
52 DlyFacPrc float64
53 DlyDistRetFlg object
54 DlyVol float64
55 DlyClose float64
56 DlyLow float64
57 DlyHigh float64
58 DlyBid float64
59 DlyAsk float64
60 DlyOpen float64
61 DlyNumTrd float64
62 DlyMMCnt float64
63 DlyPrcVol float64
dtypes: float64(20), int64(5), object(39)
memory usage: 4.2+ GB
PERMNO | SecInfoStartDt | SecInfoEndDt | SecurityBegDt | SecurityEndDt | SecurityHdrFlg | HdrCUSIP | HdrCUSIP9 | CUSIP | CUSIP9 | ... | DlyVol | DlyClose | DlyLow | DlyHigh | DlyBid | DlyAsk | DlyOpen | DlyNumTrd | DlyMMCnt | DlyPrcVol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 88291.0 | 181.68 | 181.440 | 185.30 | 181.67 | 181.70 | 185.30 | 2300.0 | 44.0 | 1.604071e+07 |
1 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 71463.0 | 184.91 | 180.890 | 185.17 | 184.90 | 184.91 | 180.89 | 1572.0 | 44.0 | 1.321422e+07 |
2 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 70308.0 | 185.07 | 183.140 | 185.52 | 184.77 | 185.07 | 184.18 | 1764.0 | 44.0 | 1.301190e+07 |
3 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 72267.0 | 183.03 | 182.475 | 184.63 | 182.90 | 183.22 | 184.40 | 2032.0 | 44.0 | 1.322703e+07 |
4 | 10026 | 2019-08-05 | 2021-04-26 | 1986-02-04 | 2023-12-29 | N | 46603210 | 466032109 | 46603210 | 466032109 | ... | 118592.0 | 182.04 | 181.810 | 183.89 | 182.04 | 182.23 | 182.73 | 2208.0 | 45.0 | 2.158849e+07 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8883151 | 93436 | 2023-10-01 | 2023-12-29 | 2010-06-29 | 2023-12-29 | Y | 88160R10 | 88160R101 | 88160R10 | 88160R101 | ... | 93148541.0 | 252.54 | 251.370 | 258.22 | 252.53 | 252.54 | 256.76 | 972876.0 | 54.0 | 2.352373e+10 |
8883152 | 93436 | 2023-10-01 | 2023-12-29 | 2010-06-29 | 2023-12-29 | Y | 88160R10 | 88160R101 | 88160R10 | 88160R101 | ... | 86700724.0 | 256.61 | 252.910 | 257.97 | 256.66 | 256.71 | 254.49 | 954492.0 | 54.0 | 2.224827e+10 |
8883153 | 93436 | 2023-10-01 | 2023-12-29 | 2010-06-29 | 2023-12-29 | Y | 88160R10 | 88160R101 | 88160R10 | 88160R101 | ... | 106250779.0 | 261.44 | 257.520 | 263.34 | 261.52 | 261.56 | 258.35 | 1046971.0 | 54.0 | 2.777820e+10 |
8883154 | 93436 | 2023-10-01 | 2023-12-29 | 2010-06-29 | 2023-12-29 | Y | 88160R10 | 88160R101 | 88160R10 | 88160R101 | ... | 113250680.0 | 253.18 | 252.710 | 265.13 | 253.08 | 253.19 | 263.66 | 1166971.0 | 54.0 | 2.867281e+10 |
8883155 | 93436 | 2023-10-01 | 2023-12-29 | 2010-06-29 | 2023-12-29 | Y | 88160R10 | 88160R101 | 88160R10 | 88160R101 | ... | 100321201.0 | 248.48 | 247.430 | 255.19 | 248.48 | 248.49 | 255.10 | 1062240.0 | 54.0 | 2.492781e+10 |
8883156 rows × 64 columns
pyarrow
library with pandasThe c
engine was historically the fastest engine for reading CSV files in pandas. It is the default engine and is written in C. The python
engine is the slowest engine and is written in Python, and is used when the user specifies parameters that are not supported by the c
engine. Pandas 2.0 added support for the pyarrow
engine is a new engine that leverages the pyarrow
library. It can be faster and more memory efficient than the c
engine because it supports multi-threading. Let’s compare the performance of the c
engine with the pyarrow
engine for reading the sample file (I tried with the python
, but I got tired of waiting…). Note that the pyarrow
engine should be faster when using the pyarrow
dtype backend, so we will compare the performance of the two engines with the default, the numpy_nullable
and the pyarrow
dtype backends.
x = time.time()
df = pd.read_csv("files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="c")
time_c_default = time.time() - x
df.info()
print(f"Reading using the c engine and default backend: {time_c_default:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/3729689134.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="c")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO int64
1 SecInfoStartDt datetime64[ns]
2 SecInfoEndDt datetime64[ns]
3 SecurityBegDt datetime64[ns]
4 SecurityEndDt datetime64[ns]
5 SecurityHdrFlg object
6 HdrCUSIP object
7 HdrCUSIP9 object
8 CUSIP object
9 CUSIP9 object
10 PrimaryExch object
11 ConditionalType object
12 ExchangeTier object
13 TradingStatusFlg object
14 SecurityNm object
15 ShareClass object
16 USIncFlg object
17 IssuerType object
18 SecurityType object
19 SecuritySubType object
20 ShareType object
21 SecurityActiveFlg object
22 DelActionType object
23 DelStatusType object
24 DelReasonType object
25 DelPaymentType object
26 Ticker object
27 TradingSymbol object
28 PERMCO int64
29 SICCD int64
30 NAICS int64
31 ICBIndustry object
32 IssuerNm object
33 YYYYMMDD datetime64[ns]
34 DlyCalDt datetime64[ns]
35 DlyDelFlg object
36 DlyPrc float64
37 DlyPrcFlg object
38 DlyCap float64
39 DlyCapFlg object
40 DlyPrevPrc float64
41 DlyPrevPrcFlg object
42 DlyPrevDt datetime64[ns]
43 DlyPrevCap float64
44 DlyPrevCapFlg object
45 DlyRet float64
46 DlyRetx float64
47 DlyRetI float64
48 DlyRetMissFlg object
49 DlyRetDurFlg object
50 DlyOrdDivAmt float64
51 DlyNonOrdDivAmt float64
52 DlyFacPrc float64
53 DlyDistRetFlg object
54 DlyVol float64
55 DlyClose float64
56 DlyLow float64
57 DlyHigh float64
58 DlyBid float64
59 DlyAsk float64
60 DlyOpen float64
61 DlyNumTrd float64
62 DlyMMCnt float64
63 DlyPrcVol float64
dtypes: datetime64[ns](7), float64(20), int64(4), object(33)
memory usage: 4.2+ GB
Reading using the c engine and default backend: 28.3934 seconds
x = time.time()
df = pd.read_csv(
"files/crsp_dsf.csv.gz",
parse_dates=date_cols,
engine="c",
dtype_backend="numpy_nullable",
)
time_c_np = time.time() - x
df.info()
print(f"Reading using the c engine and numpy_nullable backend: {time_c_np:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/1999027842.py:2: DtypeWarning: Columns (6,7,8,9,12,15,22,48) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO Int64
1 SecInfoStartDt datetime64[ns]
2 SecInfoEndDt datetime64[ns]
3 SecurityBegDt datetime64[ns]
4 SecurityEndDt datetime64[ns]
5 SecurityHdrFlg string
6 HdrCUSIP object
7 HdrCUSIP9 object
8 CUSIP object
9 CUSIP9 object
10 PrimaryExch string
11 ConditionalType string
12 ExchangeTier object
13 TradingStatusFlg string
14 SecurityNm string
15 ShareClass object
16 USIncFlg string
17 IssuerType string
18 SecurityType string
19 SecuritySubType string
20 ShareType string
21 SecurityActiveFlg string
22 DelActionType object
23 DelStatusType string
24 DelReasonType string
25 DelPaymentType string
26 Ticker string
27 TradingSymbol string
28 PERMCO Int64
29 SICCD Int64
30 NAICS Int64
31 ICBIndustry string
32 IssuerNm string
33 YYYYMMDD datetime64[ns]
34 DlyCalDt datetime64[ns]
35 DlyDelFlg string
36 DlyPrc Float64
37 DlyPrcFlg string
38 DlyCap Float64
39 DlyCapFlg string
40 DlyPrevPrc Float64
41 DlyPrevPrcFlg string
42 DlyPrevDt object
43 DlyPrevCap Float64
44 DlyPrevCapFlg string
45 DlyRet Float64
46 DlyRetx Float64
47 DlyRetI Float64
48 DlyRetMissFlg object
49 DlyRetDurFlg string
50 DlyOrdDivAmt Float64
51 DlyNonOrdDivAmt Float64
52 DlyFacPrc Float64
53 DlyDistRetFlg string
54 DlyVol Int64
55 DlyClose Float64
56 DlyLow Float64
57 DlyHigh Float64
58 DlyBid Float64
59 DlyAsk Float64
60 DlyOpen Float64
61 DlyNumTrd Int64
62 DlyMMCnt Int64
63 DlyPrcVol Float64
dtypes: Float64(17), Int64(7), datetime64[ns](6), object(9), string(25)
memory usage: 4.4+ GB
Reading using the c engine and numpy_nullable backend: 34.7816 seconds
x = time.time()
df = pd.read_csv(
"files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="c", dtype_backend="pyarrow"
)
time_c_arrow = time.time() - x
df.info()
print(f"Reading using the c engine and pyarrow backend: {time_c_arrow:.4f} seconds")
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/2542695645.py:2: DtypeWarning: Columns (6,7,8,9) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO int64[pyarrow]
1 SecInfoStartDt datetime64[ns]
2 SecInfoEndDt datetime64[ns]
3 SecurityBegDt datetime64[ns]
4 SecurityEndDt datetime64[ns]
5 SecurityHdrFlg string[pyarrow]
6 HdrCUSIP object
7 HdrCUSIP9 object
8 CUSIP object
9 CUSIP9 object
10 PrimaryExch string[pyarrow]
11 ConditionalType string[pyarrow]
12 ExchangeTier string[pyarrow]
13 TradingStatusFlg string[pyarrow]
14 SecurityNm string[pyarrow]
15 ShareClass string[pyarrow]
16 USIncFlg string[pyarrow]
17 IssuerType string[pyarrow]
18 SecurityType string[pyarrow]
19 SecuritySubType string[pyarrow]
20 ShareType string[pyarrow]
21 SecurityActiveFlg string[pyarrow]
22 DelActionType string[pyarrow]
23 DelStatusType string[pyarrow]
24 DelReasonType string[pyarrow]
25 DelPaymentType string[pyarrow]
26 Ticker string[pyarrow]
27 TradingSymbol string[pyarrow]
28 PERMCO int64[pyarrow]
29 SICCD int64[pyarrow]
30 NAICS int64[pyarrow]
31 ICBIndustry string[pyarrow]
32 IssuerNm string[pyarrow]
33 YYYYMMDD datetime64[ns]
34 DlyCalDt datetime64[ns]
35 DlyDelFlg string[pyarrow]
36 DlyPrc double[pyarrow]
37 DlyPrcFlg string[pyarrow]
38 DlyCap double[pyarrow]
39 DlyCapFlg string[pyarrow]
40 DlyPrevPrc double[pyarrow]
41 DlyPrevPrcFlg string[pyarrow]
42 DlyPrevDt object
43 DlyPrevCap double[pyarrow]
44 DlyPrevCapFlg string[pyarrow]
45 DlyRet double[pyarrow]
46 DlyRetx double[pyarrow]
47 DlyRetI double[pyarrow]
48 DlyRetMissFlg string[pyarrow]
49 DlyRetDurFlg string[pyarrow]
50 DlyOrdDivAmt double[pyarrow]
51 DlyNonOrdDivAmt double[pyarrow]
52 DlyFacPrc double[pyarrow]
53 DlyDistRetFlg string[pyarrow]
54 DlyVol int64[pyarrow]
55 DlyClose double[pyarrow]
56 DlyLow double[pyarrow]
57 DlyHigh double[pyarrow]
58 DlyBid double[pyarrow]
59 DlyAsk double[pyarrow]
60 DlyOpen double[pyarrow]
61 DlyNumTrd int64[pyarrow]
62 DlyMMCnt int64[pyarrow]
63 DlyPrcVol double[pyarrow]
dtypes: datetime64[ns](6), double[pyarrow](17), int64[pyarrow](7), object(5), string[pyarrow](29)
memory usage: 4.2+ GB
Reading using the c engine and pyarrow backend: 51.4582 seconds
x = time.time()
df = pd.read_csv("files/crsp_dsf.csv.gz", parse_dates=date_cols, engine="pyarrow")
time_arrow_default = time.time() - x
df.info()
print(
f"Reading using the pyarrow engine and default backend: {time_arrow_default:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO int64
1 SecInfoStartDt datetime64[ns]
2 SecInfoEndDt datetime64[ns]
3 SecurityBegDt datetime64[ns]
4 SecurityEndDt datetime64[ns]
5 SecurityHdrFlg object
6 HdrCUSIP object
7 HdrCUSIP9 object
8 CUSIP object
9 CUSIP9 object
10 PrimaryExch object
11 ConditionalType object
12 ExchangeTier object
13 TradingStatusFlg object
14 SecurityNm object
15 ShareClass object
16 USIncFlg object
17 IssuerType object
18 SecurityType object
19 SecuritySubType object
20 ShareType object
21 SecurityActiveFlg object
22 DelActionType object
23 DelStatusType object
24 DelReasonType object
25 DelPaymentType object
26 Ticker object
27 TradingSymbol object
28 PERMCO int64
29 SICCD int64
30 NAICS int64
31 ICBIndustry object
32 IssuerNm object
33 YYYYMMDD datetime64[ns]
34 DlyCalDt datetime64[ns]
35 DlyDelFlg object
36 DlyPrc float64
37 DlyPrcFlg object
38 DlyCap float64
39 DlyCapFlg object
40 DlyPrevPrc float64
41 DlyPrevPrcFlg object
42 DlyPrevDt object
43 DlyPrevCap float64
44 DlyPrevCapFlg object
45 DlyRet float64
46 DlyRetx float64
47 DlyRetI float64
48 DlyRetMissFlg object
49 DlyRetDurFlg object
50 DlyOrdDivAmt float64
51 DlyNonOrdDivAmt float64
52 DlyFacPrc float64
53 DlyDistRetFlg object
54 DlyVol float64
55 DlyClose float64
56 DlyLow float64
57 DlyHigh float64
58 DlyBid float64
59 DlyAsk float64
60 DlyOpen float64
61 DlyNumTrd float64
62 DlyMMCnt float64
63 DlyPrcVol float64
dtypes: datetime64[ns](6), float64(20), int64(4), object(34)
memory usage: 4.2+ GB
Reading using the pyarrow engine and default backend: 21.9848 seconds
x = time.time()
df = pd.read_csv(
"files/crsp_dsf.csv.gz",
parse_dates=date_cols,
engine="pyarrow",
dtype_backend="numpy_nullable",
)
time_arrow_np = time.time() - x
df.info()
print(
f"Reading using the pyarrow engine and numpy_nullable backend: {time_arrow_np:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO Int64
1 SecInfoStartDt datetime64[ns]
2 SecInfoEndDt datetime64[ns]
3 SecurityBegDt datetime64[ns]
4 SecurityEndDt datetime64[ns]
5 SecurityHdrFlg string
6 HdrCUSIP string
7 HdrCUSIP9 string
8 CUSIP string
9 CUSIP9 string
10 PrimaryExch string
11 ConditionalType string
12 ExchangeTier string
13 TradingStatusFlg string
14 SecurityNm string
15 ShareClass string
16 USIncFlg string
17 IssuerType string
18 SecurityType string
19 SecuritySubType string
20 ShareType string
21 SecurityActiveFlg string
22 DelActionType string
23 DelStatusType string
24 DelReasonType string
25 DelPaymentType string
26 Ticker string
27 TradingSymbol string
28 PERMCO Int64
29 SICCD Int64
30 NAICS Int64
31 ICBIndustry string
32 IssuerNm string
33 YYYYMMDD datetime64[ns]
34 DlyCalDt datetime64[ns]
35 DlyDelFlg string
36 DlyPrc Float64
37 DlyPrcFlg string
38 DlyCap Float64
39 DlyCapFlg string
40 DlyPrevPrc Float64
41 DlyPrevPrcFlg string
42 DlyPrevDt object
43 DlyPrevCap Float64
44 DlyPrevCapFlg string
45 DlyRet Float64
46 DlyRetx Float64
47 DlyRetI Float64
48 DlyRetMissFlg string
49 DlyRetDurFlg string
50 DlyOrdDivAmt Float64
51 DlyNonOrdDivAmt Float64
52 DlyFacPrc Float64
53 DlyDistRetFlg string
54 DlyVol Int64
55 DlyClose Float64
56 DlyLow Float64
57 DlyHigh Float64
58 DlyBid Float64
59 DlyAsk Float64
60 DlyOpen Float64
61 DlyNumTrd Int64
62 DlyMMCnt Int64
63 DlyPrcVol Float64
dtypes: Float64(17), Int64(7), datetime64[ns](6), object(1), string(33)
memory usage: 4.4+ GB
Reading using the pyarrow engine and numpy_nullable backend: 33.7553 seconds
x = time.time()
df = pd.read_csv(
"files/crsp_dsf.csv.gz",
parse_dates=date_cols,
engine="pyarrow",
dtype_backend="pyarrow",
)
time_arrow_arrow = time.time() - x
df.info()
print(
f"Reading using the pyarrow engine and pyarrow backend: {time_arrow_arrow:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO int64[pyarrow]
1 SecInfoStartDt date32[day][pyarrow]
2 SecInfoEndDt date32[day][pyarrow]
3 SecurityBegDt date32[day][pyarrow]
4 SecurityEndDt date32[day][pyarrow]
5 SecurityHdrFlg string[pyarrow]
6 HdrCUSIP string[pyarrow]
7 HdrCUSIP9 string[pyarrow]
8 CUSIP string[pyarrow]
9 CUSIP9 string[pyarrow]
10 PrimaryExch string[pyarrow]
11 ConditionalType string[pyarrow]
12 ExchangeTier string[pyarrow]
13 TradingStatusFlg string[pyarrow]
14 SecurityNm string[pyarrow]
15 ShareClass string[pyarrow]
16 USIncFlg string[pyarrow]
17 IssuerType string[pyarrow]
18 SecurityType string[pyarrow]
19 SecuritySubType string[pyarrow]
20 ShareType string[pyarrow]
21 SecurityActiveFlg string[pyarrow]
22 DelActionType string[pyarrow]
23 DelStatusType string[pyarrow]
24 DelReasonType string[pyarrow]
25 DelPaymentType string[pyarrow]
26 Ticker string[pyarrow]
27 TradingSymbol string[pyarrow]
28 PERMCO int64[pyarrow]
29 SICCD int64[pyarrow]
30 NAICS int64[pyarrow]
31 ICBIndustry string[pyarrow]
32 IssuerNm string[pyarrow]
33 YYYYMMDD datetime64[ns]
34 DlyCalDt date32[day][pyarrow]
35 DlyDelFlg string[pyarrow]
36 DlyPrc double[pyarrow]
37 DlyPrcFlg string[pyarrow]
38 DlyCap double[pyarrow]
39 DlyCapFlg string[pyarrow]
40 DlyPrevPrc double[pyarrow]
41 DlyPrevPrcFlg string[pyarrow]
42 DlyPrevDt date32[day][pyarrow]
43 DlyPrevCap double[pyarrow]
44 DlyPrevCapFlg string[pyarrow]
45 DlyRet double[pyarrow]
46 DlyRetx double[pyarrow]
47 DlyRetI double[pyarrow]
48 DlyRetMissFlg string[pyarrow]
49 DlyRetDurFlg string[pyarrow]
50 DlyOrdDivAmt double[pyarrow]
51 DlyNonOrdDivAmt double[pyarrow]
52 DlyFacPrc double[pyarrow]
53 DlyDistRetFlg string[pyarrow]
54 DlyVol int64[pyarrow]
55 DlyClose double[pyarrow]
56 DlyLow double[pyarrow]
57 DlyHigh double[pyarrow]
58 DlyBid double[pyarrow]
59 DlyAsk double[pyarrow]
60 DlyOpen double[pyarrow]
61 DlyNumTrd int64[pyarrow]
62 DlyMMCnt int64[pyarrow]
63 DlyPrcVol double[pyarrow]
dtypes: date32[day][pyarrow](6), datetime64[ns](1), double[pyarrow](17), int64[pyarrow](7), string[pyarrow](33)
memory usage: 4.2 GB
Reading using the pyarrow engine and pyarrow backend: 6.4519 seconds
An even better way to avoid unnecessary parsing is to use a more efficient file format. CSV files are easy to create and read, but they are not the most efficient format. Because they store the data as text, pandas has to parse the data every time we read the file. The Apache Parquet file format is more efficient because it stores the data in a binary format that is optimized for efficient reading. This means that pandas does not need to parse the data again every time we read the file. In addition, Parquet files are compressed by default, so they take less space on disk.
Parquet files are columnar, which means that the data is stored by column. This is especially efficient if you only need to read a subset of the columns. Another efficient columnar file format that is supported by pandas is Apache ORC. However, it is not as widely supported as Parquet.
Parquet files are supported by pandas through the pyarrow
library, so you will need to install it if you want to use Parquet files.
FIX In the example below, we can see that both the writing and reading of the Parquet file are faster than the writing and reading of the CSV file. Reading the parquet file is almost 5 times faster than reading the CSV file, and the disk space used by the parquet file is only 73.2 MB compared to 78.1 MB for the compressed CSV file.
x = time.time()
df.to_parquet("files/crsp_dsf.parquet", engine="pyarrow", index=False)
time_write_parquet = time.time() - x
print(
f"Writing to Parquet using the pyarrow engine and pyarrow backend: {time_write_parquet} seconds"
)
Writing to Parquet using the pyarrow engine and pyarrow backend: 7.636731147766113 seconds
x = time.time()
df = pd.read_parquet(
"files/crsp_dsf.parquet", engine="pyarrow", dtype_backend="pyarrow"
)
time_parquet_arrow_arrow = time.time() - x
df.info()
print(
f"Reading Parquet file using the pyarrow engine and pyarrow backend: {time_parquet_arrow_arrow} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 64 columns):
# Column Dtype
--- ------ -----
0 PERMNO int64[pyarrow]
1 SecInfoStartDt date32[day][pyarrow]
2 SecInfoEndDt date32[day][pyarrow]
3 SecurityBegDt date32[day][pyarrow]
4 SecurityEndDt date32[day][pyarrow]
5 SecurityHdrFlg string[pyarrow]
6 HdrCUSIP string[pyarrow]
7 HdrCUSIP9 string[pyarrow]
8 CUSIP string[pyarrow]
9 CUSIP9 string[pyarrow]
10 PrimaryExch string[pyarrow]
11 ConditionalType string[pyarrow]
12 ExchangeTier string[pyarrow]
13 TradingStatusFlg string[pyarrow]
14 SecurityNm string[pyarrow]
15 ShareClass string[pyarrow]
16 USIncFlg string[pyarrow]
17 IssuerType string[pyarrow]
18 SecurityType string[pyarrow]
19 SecuritySubType string[pyarrow]
20 ShareType string[pyarrow]
21 SecurityActiveFlg string[pyarrow]
22 DelActionType string[pyarrow]
23 DelStatusType string[pyarrow]
24 DelReasonType string[pyarrow]
25 DelPaymentType string[pyarrow]
26 Ticker string[pyarrow]
27 TradingSymbol string[pyarrow]
28 PERMCO int64[pyarrow]
29 SICCD int64[pyarrow]
30 NAICS int64[pyarrow]
31 ICBIndustry string[pyarrow]
32 IssuerNm string[pyarrow]
33 YYYYMMDD timestamp[ns][pyarrow]
34 DlyCalDt date32[day][pyarrow]
35 DlyDelFlg string[pyarrow]
36 DlyPrc double[pyarrow]
37 DlyPrcFlg string[pyarrow]
38 DlyCap double[pyarrow]
39 DlyCapFlg string[pyarrow]
40 DlyPrevPrc double[pyarrow]
41 DlyPrevPrcFlg string[pyarrow]
42 DlyPrevDt date32[day][pyarrow]
43 DlyPrevCap double[pyarrow]
44 DlyPrevCapFlg string[pyarrow]
45 DlyRet double[pyarrow]
46 DlyRetx double[pyarrow]
47 DlyRetI double[pyarrow]
48 DlyRetMissFlg string[pyarrow]
49 DlyRetDurFlg string[pyarrow]
50 DlyOrdDivAmt double[pyarrow]
51 DlyNonOrdDivAmt double[pyarrow]
52 DlyFacPrc double[pyarrow]
53 DlyDistRetFlg string[pyarrow]
54 DlyVol int64[pyarrow]
55 DlyClose double[pyarrow]
56 DlyLow double[pyarrow]
57 DlyHigh double[pyarrow]
58 DlyBid double[pyarrow]
59 DlyAsk double[pyarrow]
60 DlyOpen double[pyarrow]
61 DlyNumTrd int64[pyarrow]
62 DlyMMCnt int64[pyarrow]
63 DlyPrcVol double[pyarrow]
dtypes: date32[day][pyarrow](6), double[pyarrow](17), int64[pyarrow](7), string[pyarrow](33), timestamp[ns][pyarrow](1)
memory usage: 4.2 GB
Reading Parquet file using the pyarrow engine and pyarrow backend: 0.7189159393310547 seconds
The second tip is to only read the columns you need. This is especially important if you are working with large files as it will save memory and time. We can specify the columns we want with the usecols
parameter of the read_csv
function or the columns
parameter of the read_parquet
function. Note that for CSV files the saved time is mostly because we do not have to parse the columns we are not interested in. However, because CSV files are not indexed, we still have to read the entire file. Parquet files are indexed and columnar, so the `pyarrow`` engine does not have to read the entire file if we only need a subset of the columns.
x = time.time()
df = pd.read_csv(
"files/crsp_dsf.csv.gz", usecols=cols, parse_dates=["DlyCalDt"], engine="c"
)
time_c_default_subset = time.time() - x
df.info()
print(
f"Reading a subset of columns using the c engine and default backend: {time_c_default_subset:.4f} seconds"
)
/var/folders/jr/cn9h86ld68qb5rtvs9gsb1vr0000gn/T/ipykernel_81162/3713586357.py:2: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 9 columns):
# Column Dtype
--- ------ -----
0 PERMNO int64
1 CUSIP object
2 PrimaryExch object
3 SecurityType object
4 Ticker object
5 SICCD int64
6 DlyCalDt datetime64[ns]
7 DlyPrc float64
8 DlyRet float64
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 610.0+ MB
Reading a subset of columns using the c engine and default backend: 11.7283 seconds
x = time.time()
df = pd.read_csv(
"files/crsp_dsf.csv.gz",
usecols=cols,
parse_dates=["DlyCalDt"],
engine="pyarrow",
dtype_backend="pyarrow",
)
time_arrow_arrow_subset = time.time() - x
df.info()
print(
f"Reading a subset of columns using the pyarrow engine and pyarrow backend: {time_arrow_arrow_subset:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 9 columns):
# Column Dtype
--- ------ -----
0 DlyCalDt date32[day][pyarrow]
1 PERMNO int64[pyarrow]
2 CUSIP string[pyarrow]
3 Ticker string[pyarrow]
4 PrimaryExch string[pyarrow]
5 SecurityType string[pyarrow]
6 SICCD int64[pyarrow]
7 DlyRet double[pyarrow]
8 DlyPrc double[pyarrow]
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 583.0 MB
Reading a subset of columns using the pyarrow engine and pyarrow backend: 2.9267 seconds
x = time.time()
df = pd.read_parquet(
"files/crsp_dsf.parquet", columns=cols, engine="pyarrow", dtype_backend="pyarrow"
)
time_parquet_arrow_arrow_subset = time.time() - x
df.info()
print(
f"Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend: {time_parquet_arrow_arrow_subset:.4f} seconds"
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8883156 entries, 0 to 8883155
Data columns (total 9 columns):
# Column Dtype
--- ------ -----
0 DlyCalDt date32[day][pyarrow]
1 PERMNO int64[pyarrow]
2 CUSIP string[pyarrow]
3 Ticker string[pyarrow]
4 PrimaryExch string[pyarrow]
5 SecurityType string[pyarrow]
6 SICCD int64[pyarrow]
7 DlyRet double[pyarrow]
8 DlyPrc double[pyarrow]
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 588.7 MB
Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend: 0.0876 seconds
Pandas’ read_parquet()
function passes all the arguments to the pyarrow.parquet.read_table()
function, so we can use the filters
parameter to only read the rows we need. The filters
parameter takes a list of tuples where each tuple contains the column name, the operator, and the value. For example, if we only want to read the rows for January 2023, we can use the following filter:
from datetime import date
filters = [("DlyCalDt", ">=", date(2023, 1, 1)), ("DlyCalDt", "<", date(2023, 2, 1))]
x = time.time()
df = pd.read_parquet(
"files/crsp_dsf.parquet",
columns=cols,
filters=filters,
engine="pyarrow",
dtype_backend="pyarrow",
)
time_parquet_arrow_arrow_subset = time.time() - x
df.info()
print(
f"Reading a subset of rows and columns from Parquet file using the pyarrow engine and pyarrow backend: {time_parquet_arrow_arrow_subset:.4f} seconds"
)
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189762 entries, 0 to 189761
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DlyCalDt 189762 non-null date32[day][pyarrow]
1 PERMNO 189762 non-null int64[pyarrow]
2 CUSIP 189695 non-null string[pyarrow]
3 Ticker 189655 non-null string[pyarrow]
4 PrimaryExch 189762 non-null string[pyarrow]
5 SecurityType 189695 non-null string[pyarrow]
6 SICCD 189762 non-null int64[pyarrow]
7 DlyRet 189592 non-null double[pyarrow]
8 DlyPrc 189633 non-null double[pyarrow]
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 12.6 MB
Reading a subset of rows and columns from Parquet file using the pyarrow engine and pyarrow backend: 0.0680 seconds
DlyCalDt | PERMNO | CUSIP | Ticker | PrimaryExch | SecurityType | SICCD | DlyRet | DlyPrc | |
---|---|---|---|---|---|---|---|---|---|
0 | 2023-01-03 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | 0.011823 | 151.48 |
1 | 2023-01-04 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | -0.001716 | 151.22 |
2 | 2023-01-05 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | -0.010713 | 149.60 |
3 | 2023-01-06 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | 0.020321 | 152.64 |
4 | 2023-01-09 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | -0.019851 | 149.61 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
189757 | 2023-01-25 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.003753 | 144.43 |
189758 | 2023-01-26 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.109673 | 160.27 |
189759 | 2023-01-27 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.110002 | 177.90 |
189760 | 2023-01-30 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | -0.063182 | 166.66 |
189761 | 2023-01-31 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.039362 | 173.22 |
189762 rows × 9 columns
Note that this is not necessarily faster than reading the entire file and then filtering the rows with pandas, but it can be more memory efficient because we do not have to read the entire file into memory.
x = time.time()
df = pd.read_parquet(
"files/crsp_dsf.parquet",
columns=cols,
filters=filters,
engine="pyarrow",
dtype_backend="pyarrow",
)
df = df[(df["DlyCalDt"] >= date(2023, 1, 1)) & (df["DlyCalDt"] < date(2023, 2, 1))]
time_parquet_arrow_arrow_subset = time.time() - x
df.info()
print(
f"Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend and filtering rows using pandas: {time_parquet_arrow_arrow_subset:.4f} seconds"
)
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189762 entries, 0 to 189761
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DlyCalDt 189762 non-null date32[day][pyarrow]
1 PERMNO 189762 non-null int64[pyarrow]
2 CUSIP 189695 non-null string[pyarrow]
3 Ticker 189655 non-null string[pyarrow]
4 PrimaryExch 189762 non-null string[pyarrow]
5 SecurityType 189695 non-null string[pyarrow]
6 SICCD 189762 non-null int64[pyarrow]
7 DlyRet 189592 non-null double[pyarrow]
8 DlyPrc 189633 non-null double[pyarrow]
dtypes: date32[day][pyarrow](1), double[pyarrow](2), int64[pyarrow](2), string[pyarrow](4)
memory usage: 12.6 MB
Reading a subset of columns from Parquet file using the pyarrow engine and pyarrow backend and filtering rows using pandas: 0.0622 seconds
DlyCalDt | PERMNO | CUSIP | Ticker | PrimaryExch | SecurityType | SICCD | DlyRet | DlyPrc | |
---|---|---|---|---|---|---|---|---|---|
0 | 2023-01-03 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | 0.011823 | 151.48 |
1 | 2023-01-04 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | -0.001716 | 151.22 |
2 | 2023-01-05 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | -0.010713 | 149.60 |
3 | 2023-01-06 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | 0.020321 | 152.64 |
4 | 2023-01-09 | 10026 | 46603210 | JJSF | Q | EQTY | 2052 | -0.019851 | 149.61 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
189757 | 2023-01-25 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.003753 | 144.43 |
189758 | 2023-01-26 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.109673 | 160.27 |
189759 | 2023-01-27 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.110002 | 177.90 |
189760 | 2023-01-30 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | -0.063182 | 166.66 |
189761 | 2023-01-31 | 93436 | 88160R10 | TSLA | Q | EQTY | 9999 | 0.039362 | 173.22 |
189762 rows × 9 columns
For casual pandas users, the pyarrow
dtype backend provides long-needed features such as the ability to store missing values in integer and boolean columns1 and faster reading and writing of files.
However, it is meant to be more than just a replacement for the numpy
dtype backend. The Apache Arrow project is a cross-language development platform for in-memory data that is designed to accelerate the performance of analytical workloads. The pyarrow
library is the Python implementation of the Arrow project. One of the goals of the Arrow project is to provide a common in-memory data format for data interchange between different programming languages. This means that a pandas DataFrame with a pyarrow
dtype backend can be used as an input to other Arrow-compatible libraries such as Polars and DuckDB, or even in a different programming language such as R or Julia. This lets you use the best tool for the job without having to worry about data interchange and the associated performance overhead.
The dtype_backend
parameter of the read_csv
function is still labeled as experimental in the pandas documentation, so use it with caution.
The data types (dtypes
), i.e. the in-memory representation of the data are not the same for the default (aka numpy
), the numpy_nullable
, or the pyarrow
dtype backends. This means that if your existing code defines the data types of the columns, you will need to update it if you switch to the pyarrow
dtype backend. In most cases, all you will need to do is append [pyarrow]
to the data type, e.g. int64[pyarrow]
instead of int64
. However, you should test your code to make sure that it works as expected. However, the types aren’t all the same; for example, the default dtype backend uses the object
type to represent strings while the pyarrow
dtype backend uses the string[pyarrow]
type. Have a look at the pandas documentation for more information.
I hope these tips will save you time and memory when reading and parsing large CSV files in Python! As you can see, the Arrow library and the Parquet file format are powerful tools for reading and writing data, and it is worth exploring if you are working with large datasets. This is not surprising, as the Parquet format was created as a columnar storage format for the Hadoop big data ecosystem, and the Arrow library was created to provide a standard for in-memory data representation. The Arrow library is also used by other data processing tools such as DuckDB and Polars, so it is worth learning about it if you are working with large datasets.
Note that the numpy_nullable
backend has supported this for a while.↩︎