How to Work with Parquet Files in Python – A Guide with Examples

If you’ve spent time in data engineering or analytics, you’ve almost certainly run into Parquet files. They show up everywhere — in data lakes, machine learning pipelines, cloud storage buckets, and more.

But what makes Parquet special, and how do you actually work with it in Python?

In this tutorial, I’ll walk you through reading, writing, filtering, and compressing Parquet files using Python. By the end, you’ll have a solid foundation for using Parquet in your own data pipelines.

You can get the code on GitHub.

Table of Contents

  1. What Is the Parquet File Format?
  2. Prerequisites
  3. Writing Parquet Files in Python
  4. Reading Parquet Files in Python
  5. Reading Specific Columns
  6. Writing Parquet Files with Compression
  7. Working with Row Groups and Filtering
  8. A Practical Example: Analyzing Sales Data
  9. When Should You Use Parquet?

What Is the Parquet File Format?

Parquet is an open-source columnar storage format originally developed by Cloudera and Twitter, and now part of the Apache ecosystem.

Unlike CSV or JSON, which store data row by row, Parquet stores data column by column. That might seem like a small implementation detail, but it has a big impact on performance.

Here’s why Parquet is widely used:

  • Efficient compression — because all values in a column share the same data type, they compress far better than mixed-type rows
  • Column pruning — you can read only the columns you need, skipping the rest entirely
  • Schema preservation — data types are stored in the file, so integers don’t become strings on read (and other subtle bugs)
  • Broad ecosystem support — works natively with Spark, Pandas, DuckDB, BigQuery, Athena, and more

Parquet is often compared to ORC (or optimized row columnar). Both are columnar formats, but Parquet has wider cross-platform support and tends to be the default choice outside of Hadoop/Hive environments.

Prerequisites

Before we get started, make sure you have:

  • Python 3.10 or a later version installed
  • Basic familiarity with pandas DataFrames
  • Some understanding of file I/O in Python

You’ll need to install two libraries:

pip install pyarrow pandas

We’ll use PyArrow as the engine for reading and writing Parquet files. You can also use [fastparquet](https://fastparquet.readthedocs.io/en/latest/) as an alternative engine, but PyArrow is the more commonly recommended choice.

Writing Parquet Files in Python

Let’s start by creating a Parquet file. We’ll build a small product inventory dataset and write it to disk.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create a product inventory dataset
data = {
    'product_id': [1001, 1002, 1003, 1004, 1005],
    'product_name': ['Wireless Mouse', 'Mechanical Keyboard', 'USB-C Hub', 'Monitor Stand', 'Webcam HD'],
    'category': ['Peripherals', 'Peripherals', 'Accessories', 'Accessories', 'Peripherals'],
    'price': [29.99, 89.99, 49.99, 34.99, 74.99],
    'stock': [150, 80, 200, 95, 60]
}

df = pd.DataFrame(data)

# Convert to PyArrow Table and write as Parquet
table = pa.Table.from_pandas(df)
pq.write_table(table, 'inventory.parquet')

print("Parquet file created successfully!")

Output:

Parquet file created successfully!

Here’s what’s happening step by step. We start with a standard pandas DataFrame. Then we convert it to a PyArrow Table using pa.Table.from_pandas(). This is PyArrow’s in-memory columnar representation. Finally, pq.write_table() serializes it to disk in Parquet format.

The conversion step is necessary because Parquet is a columnar format and PyArrow handles the translation between pandas’ row-oriented structure and Parquet’s column-oriented storage.

Reading Parquet Files in Python

Now let’s read the file back:

# Read the Parquet file
table = pq.read_table('inventory.parquet')

# Convert to pandas for easy viewing
df_read = table.to_pandas()

print(df_read)
print(f"nData types:n{df_read.dtypes}")

Output:

   product_id         product_name     category  price  stock
0        1001       Wireless Mouse  Peripherals  29.99    150
1        1002  Mechanical Keyboard  Peripherals  89.99     80
2        1003            USB-C Hub  Accessories  49.99    200
3        1004        Monitor Stand  Accessories  34.99     95
4        1005            Webcam HD  Peripherals  74.99     60

Data types:
product_id        int64
product_name     object
category         object
price           float64
stock             int64
dtype: object

pq.read_table() loads the Parquet file into a PyArrow Table. We then call .to_pandas() to convert it into a familiar DataFrame.

Notice that data types are preserved correctly — price is float64, not a string. This is one of Parquet’s practical advantages over CSV, where you often have to explicitly cast types after reading.

Reading Specific Columns

One of Parquet’s most powerful features is the ability to read only the columns you actually need. This is called column pruning, and it can substantially reduce memory usage and load time on large files.

# Read only product name and price
table_subset = pq.read_table('inventory.parquet', columns=['product_name', 'price'])
df_subset = table_subset.to_pandas()

print(df_subset)

Output:

          product_name  price
0       Wireless Mouse  29.99
1  Mechanical Keyboard  89.99
2            USB-C Hub  49.99
3        Monitor Stand  34.99
4            Webcam HD  74.99

By passing a columns list to read_table(), PyArrow only reads those two columns from disk — the rest are skipped entirely. If your file has 100 columns but your query only needs 4, you’re doing roughly 4% of the I/O work. That adds up fast at scale.

Writing Parquet Files with Compression

Parquet supports several compression codecs. Let’s compare them on a larger dataset:

import os

# Generate a larger dataset
large_data = {
    'transaction_id': range(50000),
    'store': ['Downtown', 'Uptown', 'Suburban', 'Airport', 'Mall'] * 10000,
    'product': [f'product_{i % 200}' for i in range(50000)],
    'amount': [round(10 + (i % 500) * 0.5, 2) for i in range(50000)]
}

df_large = pd.DataFrame(large_data)
table_large = pa.Table.from_pandas(df_large)

# Write with different compression codecs
for codec in ['NONE', 'SNAPPY', 'GZIP', 'ZSTD']:
    path = f'transactions_{codec.lower()}.parquet'
    pq.write_table(table_large, path, compression=codec)
    size = os.path.getsize(path)
    print(f"{codec:<8}: {size:>10,} bytes")

Output:

NONE    :    635,565 bytes
SNAPPY  :    316,033 bytes
GZIP    :    177,654 bytes
ZSTD    :    158,082 bytes

Each codec makes a different trade-off between file size and speed:

  • NONE — no compression, fastest write, largest file
  • SNAPPY — fast compression and decompression, moderate size reduction, good for hot data
  • GZIP — better compression than Snappy, slower, good for archival
  • ZSTD — best compression ratio with competitive speed, generally the recommended default

For most production use cases, ZSTD is a solid default. It gives you the smallest files without a significant speed penalty.

Working with Row Groups and Filtering

Parquet files are internally divided into row groups — chunks of rows stored together. This structure enables a powerful optimization called predicate pushdown, where Parquet skips entire row groups that can’t possibly match your filter condition.

Let’s write a file with explicit row group sizing and then filter it:

# Write with smaller row groups (useful for filtering large files)
pq.write_table(
    table_large,
    'transactions_grouped.parquet',
    compression='ZSTD',
    row_group_size=10000
)

# Use PyArrow's dataset API for efficient filtering
import pyarrow.dataset as ds

dataset = ds.dataset('transactions_grouped.parquet', format='parquet')

# Filter: only transactions from the Downtown store over $150
filtered = dataset.to_table(
    filter=(ds.field('store') == 'Downtown') & (ds.field('amount') > 150)
)

df_filtered = filtered.to_pandas()
print(f"Matching transactions: {len(df_filtered)}")
print(df_filtered.head())

Output:

Matching transactions: 4300
   transaction_id     store      product  amount
0             285  Downtown   product_85   152.5
1             290  Downtown   product_90   155.0
2             295  Downtown   product_95   157.5
3             300  Downtown  product_100   160.0
4             305  Downtown  product_105   162.5

The ds.dataset() API with a filter argument doesn’t load the entire file into memory. Instead, it evaluates the filter against each row group’s metadata and skips groups that can’t contain matching rows. The actual rows are only read when you call .to_table(). This is particularly useful for large files.

A Practical Example: Analyzing Sales Data

Let’s put it all together with a better example. Say you receive daily sales exports and need to store them efficiently and query them by region and date.

from datetime import datetime, timedelta
import random

# Simulate 30 days of sales data
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Tablet', 'Headphones', 'Charger', 'Case']

records = []
start = datetime(2025, 1, 1)

for day in range(30):
    date = start + timedelta(days=day)
    for _ in range(200):
        records.append({
            'date': date.strftime('%Y-%m-%d'),
            'region': random.choice(regions),
            'product': random.choice(products),
            'units_sold': random.randint(1, 20),
            'revenue': round(random.uniform(50, 2000), 2)
        })

df_sales = pd.DataFrame(records)
table_sales = pa.Table.from_pandas(df_sales)

# Save as compressed Parquet
pq.write_table(table_sales, 'sales_jan2025.parquet', compression='ZSTD')
print(f"Saved {len(df_sales):,} records to Parquet")

# Query: total revenue by region for the first two weeks
dataset = ds.dataset('sales_jan2025.parquet', format='parquet')

first_two_weeks = dataset.to_table(
    columns=['region', 'revenue', 'date'],
    filter=ds.field('date') <= '2025-01-14'
).to_pandas()

summary = first_two_weeks.groupby('region')['revenue'].sum().sort_values(ascending=False)
print(f"nRevenue by region (Jan 1–14):")
print(summary.round(2))

Output:

Saved 6,000 records to Parquet

Revenue by region (Jan 1–14):
region
West     752919.48
North    739783.22
South    730002.82
East     692128.58
Name: revenue, dtype: float64

This is a pattern you’ll see in real data engineering work. Parquet handles the storage efficiently, column pruning limits what gets loaded into memory, and predicate pushdown ensures only the relevant date range is scanned.

When Should You Use Parquet?

Use Parquet when you:

  • Work with analytical or reporting workloads where you query specific columns
  • Need efficient, compressed storage for medium-to-large datasets
  • Store data in cloud platforms (S3, GCS, Azure Blob) that are billed by bytes scanned
  • Use tools like Spark, DuckDB, Athena, or BigQuery that have native Parquet support
  • Want schema enforcement across your pipeline

Don’t use Parquet when you:

  • Need human-readable files; use CSV or JSON for that
  • Are doing frequent row-level updates or inserts; Parquet is write-once, read-many
  • Work with very small datasets where the overhead isn’t worth it
  • Need streaming row-by-row writes; consider Avro instead

Conclusion

I hope you found this tutorial useful. Parquet’s columnar layout, built-in compression, and schema preservation add up to faster queries, smaller files, and fewer type-related bugs.

In this tutorial, you learned how to write and read Parquet files, use column pruning to reduce I/O, apply compression codecs, and filter large files efficiently using predicate pushdown. These are the building blocks you’ll reach for repeatedly in any data-heavy Python project.

Try swapping your next CSV export for Parquet and see how it affects your pipeline. Happy coding!

Leave a Reply