Pandas read_parquet vs read_csv: speed comparison and best practices
Loading large datasets in pandas often shows a stark speed gap between read_parquet and read_csv. Parquet files, produced by modern data pipelines, store columns in a compressed binary format, while CSVs are plain text. The difference can halve processing time in production ETL jobs.
# Example showing the issue
import pandas as pd
import time
csv_path = 'data/sample.csv'
parquet_path = 'data/sample.parquet'
# CSV benchmark
start = time.time()
df_csv = pd.read_csv(csv_path)
print(f'CSV rows: {len(df_csv)}, time: {time.time() - start:.3f}s')
# Parquet benchmark
start = time.time()
df_parquet = pd.read_parquet(parquet_path)
print(f'Parquet rows: {len(df_parquet)}, time: {time.time() - start:.3f}s')
# Output typically shows Parquet loading ~2× faster
Parquet stores data column‑wise with built‑in compression, allowing pandas to read only the needed columns and to decompress in C code. CSV is row‑wise plain text, requiring full parsing and conversion of every field. This behavior follows the Apache Parquet specification and pandas’ use of the pyarrow engine. Related factors:
- Columnar layout reduces I/O volume
- Binary encoding avoids Python string parsing
- Optional predicate pushdown skips irrelevant rows
To diagnose this in your code:
# Simple benchmark using %timeit in a notebook
%timeit -n 5 pd.read_csv('data/sample.csv')
%timeit -n 5 pd.read_parquet('data/sample.parquet')
# Compare the median times; a large gap indicates a speed advantage for Parquet
Fixing the Issue
If your pipeline consumes data generated by modern ETL jobs, store it as Parquet and load with pandas.read_parquet. The columnar format cuts I/O and parsing overhead.
# Prefer Parquet for large, column‑oriented datasets
import pandas as pd
df = pd.read_parquet('s3://bucket/dataset.parquet', engine='pyarrow')
When you must ingest CSVs (e.g., legacy exports), speed can still be improved by:
- Specifying dtypes to avoid type inference overhead
- Using the C engine (
engine='c') which is the default but can be forced - Enabling
memory_map=Truefor large files - Converting once to Parquet for downstream runs
df = pd.read_csv('data/legacy.csv', dtype={'id': 'int64', 'value': 'float32'}, memory_map=True)
The gotcha is that converting CSV to Parquet adds an upfront cost; the payoff appears only on repeated reads. In production we usually add a step that writes incoming CSV batches to Parquet, then all downstream jobs switch to read_parquet.
For maximum safety, validate that the schema matches expectations after each load:
expected_cols = {'id', 'value', 'timestamp'}
assert set(df.columns) == expected_cols, 'Schema mismatch detected'
What Doesn’t Work
❌ Loading the CSV first then calling df.to_parquet() just to re‑read it later: adds an unnecessary conversion step and doubles I/O.
❌ Using .apply(lambda x: ...) to cast columns after read_csv: defeats the purpose of dtype optimization and is slow.
❌ Forcing engine='python' on read_csv: the pure‑Python parser is much slower than the default C engine.
- Reading CSV without specifying dtypes, causing slow type inference
- Assuming Parquet is always faster even for tiny files
- Neglecting to validate schema after loading Parquet
When NOT to optimize
- Small files: Under a few megabytes the speed gain is negligible
- Ad‑hoc analysis: One‑off notebooks where readability matters more than performance
- Legacy systems: Downstream tools require CSV input
- Data provenance: Audits need the original human‑readable format
Frequently Asked Questions
Q: Does read_parquet always guarantee a speed boost over read_csv?
Only when the source file is large enough for I/O savings and the data fits Parquet’s columnar model.
Q: Can I speed up CSV reads by using a different engine?
Yes, the C engine (engine='c') is fastest; specifying dtype and memory_map=True also helps.
Choosing the right file format is a core performance decision in any pandas workflow. While Parquet shines for large, column‑oriented data, CSV still has its place for quick dumps and human inspection. Align the loader with your data volume and downstream requirements to keep pipelines both fast and reliable.
Related Issues
→ Why pandas read_csv parse_dates slows loading → Why buffer protocol speeds up pandas DataFrame I/O → Why pandas read_csv low_memory warning appears → Why pandas concat uses more memory than append