What is Parquet and Why You Should Use It

If you work with data, chances are you have spent a fair bit of your career reading and writing CSV files. CSV is everywhere, and for good reason: it is simple, it is human-readable, and every tool under the sun knows how to open it. But once your datasets start growing past a few hundred megabytes, CSV starts to feel like the wrong tool for the job. This is where Apache Parquet comes in.

The easiest way to think about Parquet is as a CSV file that is not human-readable. You give up the ability to open the file in a text editor (or in Excel), and in exchange you get a format that perfectly encodes your data in binary. No more guessing whether a column is an integer or a string, no more worrying about how dates are formatted, no more silent loss of precision on floating-point numbers. The data on disk is the data you get back when you read the file, with the correct types and nothing lost in translation.

The first thing I do when I start a project is convert all our raw data sources to Parquet so that we can do all our analysis using only Parquet from that point on.

Why Parquet is faster (and lighter) than CSV

Parquet is a columnar storage format with built-in compression. Those two properties are what make it so much better than CSV for analytical work.

Because the data is stored column by column rather than row by row, software can read only the columns it actually needs. If your dataset has 200 columns and your analysis only touches 5 of them, Parquet lets you skip the other 195 entirely. With CSV, you have to read and parse the whole row to get to the columns you want.

The bigger speedup, though, comes from skipping the text-parsing step altogether. With a CSV file, every value on disk is a sequence of characters that has to be parsed and converted to the right in-memory type—an integer, a float, a timestamp, and so on. That parsing is slow, and on large files it dominates the total read time. Parquet stores each column directly in the binary representation that the analysis library uses in memory, so reading the file is mostly a matter of copying bytes into RAM. Writing is similarly fast because there is no formatting step on the way out.

A nice side benefit is that Parquet typically needs less active RAM to read and write data. With CSV, the library has to hold both the raw text and the parsed in-memory representation at various points during the parse. Parquet sidesteps that whole process, so peak memory usage tends to be noticeably lower—a real lifesaver when you are working with datasets that are close to the limits of your machine.

On top of all that, Parquet files are compressed by default, so they take less space on disk. It is not unusual to see a Parquet file that is 5 to 10 times smaller than the equivalent CSV, even before you compress the CSV with gzip or zstd.

Reading and writing Parquet in Python

The two libraries I use most often for tabular data in Python are pandas and Polars, and both have first-class support for Parquet.

In pandas, the functions you need are pd.read_parquet and DataFrame.to_parquet. They are drop-in replacements for read_csv and to_csv:

import pandas as pd

df = pd.read_parquet("data.parquet")
df.to_parquet("output.parquet")

Under the hood, pandas uses pyarrow (or fastparquet as a fallback) to do the work, so you will need to have one of them installed. pyarrow is the one I recommend.

In Polars, the equivalents are pl.read_parquet and DataFrame.write_parquet. Polars also offers pl.scan_parquet, which returns a LazyFrame instead of a DataFrame. This is one of my favorite features of Polars: you can describe an entire query (filters, joins, aggregations) against a Parquet file, and Polars will figure out which columns and row groups it actually needs to read before touching the disk. For large files, the speedup is dramatic.

import polars as pl

# Eager: read the whole file into memory
df = pl.read_parquet("data.parquet")

# Lazy: build a query plan and only read what's needed
lf = pl.scan_parquet("data.parquet")
result = lf.filter(pl.col("year") == 2024).select(["ticker", "ret"]).collect()

If you are working in R, the arrow package is the standard choice. It provides read_parquet() and write_parquet(), plus open_dataset() for working with larger collections of files. The nanoparquet package is a lightweight, dependency-free alternative if you want something smaller.

Stata users are not left out either: recent versions of Stata can read Parquet files natively. See the Stata documentation on importing Parquet files for details. SAS does not support Parquet natively, but the SAS documentation describes a few workarounds you can use to get Parquet data into and out of SAS.

Hive partitioning

Once you start storing your datasets as Parquet, the next thing worth knowing is Hive partitioning, which is supported by essentially every tool that reads Parquet.

The idea is simple. Instead of one giant file, you split your data into many smaller Parquet files organized in a directory tree where the directory names encode the values of one or more columns. For example, a dataset partitioned by year and market might look like this:

dataset/
├── year=2023/
│   ├── market=BTC/
│   │   └── part-0.parquet
│   └── market=ETH/
│       └── part-0.parquet
└── year=2024/
    ├── market=BTC/
    │   └── part-0.parquet
    └── market=ETH/
        └── part-0.parquet

Any Hive-aware tool will treat that whole directory tree as one large dataset and will use the directory names to skip files it does not need. If you ask for year == 2024 and market == "BTC", the tool reads exactly one file. The year and market columns are reconstructed from the directory names, so you do not pay to store them in the files themselves.

Both pandas (via partition_cols=) and Polars (via write_parquet(..., partition_by=...) and scan_parquet on a directory) can read and write Hive-partitioned datasets, and so can the R arrow package (open_dataset()), DuckDB, and most cloud data tools.

This is how my coauthors and I distributed the data for our latest paper, “Who Wins and Who Loses In Prediction Markets? Evidence from Polymarket.” The full dataset is available on Hugging Face, partitioned by date, so you can point any Parquet-aware tool at the directory and start querying without having to download or load anything you do not need.

Wrapping up

If you are still defaulting to CSV for your research data, give Parquet a try on your next project. The switch is essentially a one-line change in your code, and you get faster reads and writes, smaller files, lower memory use, and—because the types are encoded in the file—fewer of those annoying “wait, why is this column a string now?” moments. Once you add Hive partitioning to the mix, you can work with datasets that would be impossibly slow to handle as a single CSV file. There are very few reasons left to use CSV for anything beyond quick interchange with humans or with tools that genuinely cannot read anything else.

Reuse

CC BY-NC-SA 4.0