If you’ve spent time in data engineering or analytics, you’ve almost certainly run into Parquet files. They show up everywhere — in data lakes, machine learning pipelines, cloud storage buckets, and more.
But what makes Parquet special, and how do you actually work with it in Python?
In this tutorial, I’ll walk you through reading, writing, filtering, and compressing Parquet files using Python. By the end, you’ll have a solid foundation for using Parquet in your own data pipelines.
You can get the code on GitHub.
Table of Contents
- What Is the Parquet File Format?
- Prerequisites
- Writing Parquet Files in Python
- Reading Parquet Files in Python
- Reading Specific Columns
- Writing Parquet Files with Compression
- Working with Row Groups and Filtering
- A Practical Example: Analyzing Sales Data
- When Should You Use Parquet?
What Is the Parquet File Format?
Parquet is an open-source columnar storage format originally developed by Cloudera and Twitter, and now part of the Apache ecosystem.
Unlike CSV or JSON, which store data row by row, Parquet stores data column by column. That might seem like a small implementation detail, but it has a big impact on performance.
Here’s why Parquet is widely used:
- Efficient compression — because all values in a column share the same data type, they compress far better than mixed-type rows
- Column pruning — you can read only the columns you need, skipping the rest entirely
- Schema preservation — data types are stored in the file, so integers don’t become strings on read (and other subtle bugs)
- Broad ecosystem support — works natively with Spark, Pandas, DuckDB, BigQuery, Athena, and more
Parquet is often compared to ORC (or optimized row columnar). Both are columnar formats, but Parquet has wider cross-platform support and tends to be the default choice outside of Hadoop/Hive environments.
Prerequisites
Before we get started, make sure you have:
- Python 3.10 or a later version installed
- Basic familiarity with pandas DataFrames
- Some understanding of file I/O in Python
You’ll need to install two libraries:
pip install pyarrow pandas
We’ll use PyArrow as the engine for reading and writing Parquet files. You can also use [fastparquet](https://fastparquet.readthedocs.io/en/latest/) as an alternative engine, but PyArrow is the more commonly recommended choice.
Writing Parquet Files in Python
Let’s start by creating a Parquet file. We’ll build a small product inventory dataset and write it to disk.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Create a product inventory dataset
data = {
'product_id': [1001, 1002, 1003, 1004, 1005],
'product_name': ['Wireless Mouse', 'Mechanical Keyboard', 'USB-C Hub', 'Monitor Stand', 'Webcam HD'],
'category': ['Peripherals', 'Peripherals', 'Accessories', 'Accessories', 'Peripherals'],
'price': [29.99, 89.99, 49.99, 34.99, 74.99],
'stock': [150, 80, 200, 95, 60]
}
df = pd.DataFrame(data)
# Convert to PyArrow Table and write as Parquet
table = pa.Table.from_pandas(df)
pq.write_table(table, 'inventory.parquet')
print("Parquet file created successfully!")
Output:
Parquet file created successfully!
Here’s what’s happening step by step. We start with a standard pandas DataFrame. Then we convert it to a PyArrow Table using pa.Table.from_pandas(). This is PyArrow’s in-memory columnar representation. Finally, pq.write_table() serializes it to disk in Parquet format.
The conversion step is necessary because Parquet is a columnar format and PyArrow handles the translation between pandas’ row-oriented structure and Parquet’s column-oriented storage.
Reading Parquet Files in Python
Now let’s read the file back:
# Read the Parquet file
table = pq.read_table('inventory.parquet')
# Convert to pandas for easy viewing
df_read = table.to_pandas()
print(df_read)
print(f"nData types:n{df_read.dtypes}")
Output:
product_id product_name category price stock
0 1001 Wireless Mouse Peripherals 29.99 150
1 1002 Mechanical Keyboard Peripherals 89.99 80
2 1003 USB-C Hub Accessories 49.99 200
3 1004 Monitor Stand Accessories 34.99 95
4 1005 Webcam HD Peripherals 74.99 60
Data types:
product_id int64
product_name object
category object
price float64
stock int64
dtype: object
pq.read_table() loads the Parquet file into a PyArrow Table. We then call .to_pandas() to convert it into a familiar DataFrame.
Notice that data types are preserved correctly — price is float64, not a string. This is one of Parquet’s practical advantages over CSV, where you often have to explicitly cast types after reading.
Reading Specific Columns
One of Parquet’s most powerful features is the ability to read only the columns you actually need. This is called column pruning, and it can substantially reduce memory usage and load time on large files.
# Read only product name and price
table_subset = pq.read_table('inventory.parquet', columns=['product_name', 'price'])
df_subset = table_subset.to_pandas()
print(df_subset)
Output:
product_name price
0 Wireless Mouse 29.99
1 Mechanical Keyboard 89.99
2 USB-C Hub 49.99
3 Monitor Stand 34.99
4 Webcam HD 74.99
By passing a columns list to read_table(), PyArrow only reads those two columns from disk — the rest are skipped entirely. If your file has 100 columns but your query only needs 4, you’re doing roughly 4% of the I/O work. That adds up fast at scale.
Writing Parquet Files with Compression
Parquet supports several compression codecs. Let’s compare them on a larger dataset:
import os
# Generate a larger dataset
large_data = {
'transaction_id': range(50000),
'store': ['Downtown', 'Uptown', 'Suburban', 'Airport', 'Mall'] * 10000,
'product': [f'product_{i % 200}' for i in range(50000)],
'amount': [round(10 + (i % 500) * 0.5, 2) for i in range(50000)]
}
df_large = pd.DataFrame(large_data)
table_large = pa.Table.from_pandas(df_large)
# Write with different compression codecs
for codec in ['NONE', 'SNAPPY', 'GZIP', 'ZSTD']:
path = f'transactions_{codec.lower()}.parquet'
pq.write_table(table_large, path, compression=codec)
size = os.path.getsize(path)
print(f"{codec:<8}: {size:>10,} bytes")
Output:
NONE : 635,565 bytes
SNAPPY : 316,033 bytes
GZIP : 177,654 bytes
ZSTD : 158,082 bytes
Each codec makes a different trade-off between file size and speed:
- NONE — no compression, fastest write, largest file
- SNAPPY — fast compression and decompression, moderate size reduction, good for hot data
- GZIP — better compression than Snappy, slower, good for archival
- ZSTD — best compression ratio with competitive speed, generally the recommended default
For most production use cases, ZSTD is a solid default. It gives you the smallest files without a significant speed penalty.
Working with Row Groups and Filtering
Parquet files are internally divided into row groups — chunks of rows stored together. This structure enables a powerful optimization called predicate pushdown, where Parquet skips entire row groups that can’t possibly match your filter condition.
Let’s write a file with explicit row group sizing and then filter it:
# Write with smaller row groups (useful for filtering large files)
pq.write_table(
table_large,
'transactions_grouped.parquet',
compression='ZSTD',
row_group_size=10000
)
# Use PyArrow's dataset API for efficient filtering
import pyarrow.dataset as ds
dataset = ds.dataset('transactions_grouped.parquet', format='parquet')
# Filter: only transactions from the Downtown store over $150
filtered = dataset.to_table(
filter=(ds.field('store') == 'Downtown') & (ds.field('amount') > 150)
)
df_filtered = filtered.to_pandas()
print(f"Matching transactions: {len(df_filtered)}")
print(df_filtered.head())
Output:
Matching transactions: 4300
transaction_id store product amount
0 285 Downtown product_85 152.5
1 290 Downtown product_90 155.0
2 295 Downtown product_95 157.5
3 300 Downtown product_100 160.0
4 305 Downtown product_105 162.5
The ds.dataset() API with a filter argument doesn’t load the entire file into memory. Instead, it evaluates the filter against each row group’s metadata and skips groups that can’t contain matching rows. The actual rows are only read when you call .to_table(). This is particularly useful for large files.
A Practical Example: Analyzing Sales Data
Let’s put it all together with a better example. Say you receive daily sales exports and need to store them efficiently and query them by region and date.
from datetime import datetime, timedelta
import random
# Simulate 30 days of sales data
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Tablet', 'Headphones', 'Charger', 'Case']
records = []
start = datetime(2025, 1, 1)
for day in range(30):
date = start + timedelta(days=day)
for _ in range(200):
records.append({
'date': date.strftime('%Y-%m-%d'),
'region': random.choice(regions),
'product': random.choice(products),
'units_sold': random.randint(1, 20),
'revenue': round(random.uniform(50, 2000), 2)
})
df_sales = pd.DataFrame(records)
table_sales = pa.Table.from_pandas(df_sales)
# Save as compressed Parquet
pq.write_table(table_sales, 'sales_jan2025.parquet', compression='ZSTD')
print(f"Saved {len(df_sales):,} records to Parquet")
# Query: total revenue by region for the first two weeks
dataset = ds.dataset('sales_jan2025.parquet', format='parquet')
first_two_weeks = dataset.to_table(
columns=['region', 'revenue', 'date'],
filter=ds.field('date') <= '2025-01-14'
).to_pandas()
summary = first_two_weeks.groupby('region')['revenue'].sum().sort_values(ascending=False)
print(f"nRevenue by region (Jan 1–14):")
print(summary.round(2))
Output:
Saved 6,000 records to Parquet
Revenue by region (Jan 1–14):
region
West 752919.48
North 739783.22
South 730002.82
East 692128.58
Name: revenue, dtype: float64
This is a pattern you’ll see in real data engineering work. Parquet handles the storage efficiently, column pruning limits what gets loaded into memory, and predicate pushdown ensures only the relevant date range is scanned.
When Should You Use Parquet?
Use Parquet when you:
- Work with analytical or reporting workloads where you query specific columns
- Need efficient, compressed storage for medium-to-large datasets
- Store data in cloud platforms (S3, GCS, Azure Blob) that are billed by bytes scanned
- Use tools like Spark, DuckDB, Athena, or BigQuery that have native Parquet support
- Want schema enforcement across your pipeline
Don’t use Parquet when you:
- Need human-readable files; use CSV or JSON for that
- Are doing frequent row-level updates or inserts; Parquet is write-once, read-many
- Work with very small datasets where the overhead isn’t worth it
- Need streaming row-by-row writes; consider Avro instead
Conclusion
I hope you found this tutorial useful. Parquet’s columnar layout, built-in compression, and schema preservation add up to faster queries, smaller files, and fewer type-related bugs.
In this tutorial, you learned how to write and read Parquet files, use column pruning to reduce I/O, apply compression codecs, and filter large files efficiently using predicate pushdown. These are the building blocks you’ll reach for repeatedly in any data-heavy Python project.
Try swapping your next CSV export for Parquet and see how it affects your pipeline. Happy coding!
