Quantified Self: Building a Blazing Fast Health Dashboard with DuckDB and Streamlit

Have you ever tried exporting your Apple Health data, only to find a massive, 2GB+ export.xml file that makes your text editor cry? 😭 As a developer obsessed with the Quantified Self movement, I wanted to turn that mountain of raw data into actionable insights—without waiting ten minutes for a single correlation plot.

In this tutorial, we are diving deep into Data Engineering for personal analytics. We’ll leverage the insane speed of DuckDB, the flexibility of Pyarrow, and the simplicity of Streamlit to build a high-performance health dashboard. Whether you are dealing with millions of heart rate rows or gigabytes of GPS points, this stack ensures millisecond-level DuckDB performance for all your Apple HealthKit analysis.

Why this stack?

If you’ve been following modern data stacks, you know that OLAP for wearables is becoming a hot topic. Traditional Python libraries like Pandas are great, but they often struggle with the memory overhead of large nested XML structures.

  • DuckDB: The “SQLite for Analytics.” It’s an in-process columnar database that runs SQL at the speed of light.
  • Pyarrow: The bridge that allows us to move data between formats with zero-copy overhead.
  • Streamlit: The fastest way to turn data scripts into shareable web apps.

The Architecture 🏗️

The biggest challenge with wearable data is the “Extract-Transform-Load” (ETL) process. We need to turn a hierarchical XML file into a flattened, queryable Parquet format that DuckDB can devour.

graph TD
    A[Apple Health export.xml] --> B[Python XML Parser]
    B --> C[Pyarrow Table]
    C --> D[Parquet Storage]
    D --> E[DuckDB Engine]
    E --> F[Streamlit Dashboard]
    F --> G[Millisecond Insights 🚀]

Prerequisites

Before we start, ensure you have your export.xml ready and these tools installed:

pip install duckdb streamlit pandas pyarrow

Step 1: From XML Chaos to Parquet Order

Apple’s XML format is… “unique.” We use Pyarrow to define a schema and convert those records into a compressed Parquet file. This reduces our file size by up to 90% and optimizes it for columnar reads.

import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def parse_health_data(xml_path):
    tree = ET.parse(xml_path)
    root = tree.getroot()

    # We only care about Record types for this dashboard
    records = []
    for record in root.findall('Record'):
        records.append({
            'type': record.get('type'),
            'value': record.get('value'),
            'unit': record.get('unit'),
            'creationDate': record.get('creationDate'),
            'startDate': record.get('startDate')
        })

    df = pd.DataFrame(records)
    # Convert dates to proper datetime objects
    df['startDate'] = pd.to_datetime(df['startDate'])
    df['value'] = pd.to_numeric(df['value'], errors='coerce')

    # Convert to Arrow Table and save as Parquet
    table = pa.Table.from_pandas(df)
    pq.write_table(table, 'health_data.parquet')
    print("✅ Transformation complete!")

# parse_health_data('export.xml')

Step 2: The Secret Sauce — DuckDB Logic

Now for the magic. Instead of loading the entire Parquet file into RAM with Pandas, we query it directly using DuckDB. This allows us to perform complex aggregations (like heart rate variability vs. sleep quality) in milliseconds.

import duckdb

def get_heart_rate_summary():
    con = duckdb.connect(database=':memory:')
    # DuckDB can query Parquet files directly!
    res = con.execute("""
        SELECT 
            date_trunc('day', startDate) as day,
            avg(value) as avg_heart_rate,
            max(value) as max_heart_rate
        FROM 'health_data.parquet'
        WHERE type = 'HKQuantityTypeIdentifierHeartRate'
        GROUP BY 1
        ORDER BY 1 DESC
    """).df()
    return res

Step 3: Building the Streamlit UI 🎨

Streamlit makes it incredibly easy to visualize these SQL results. We can add sliders, date pickers, and interactive charts with just a few lines of code.

import streamlit as st
import plotly.express as px

st.set_page_config(page_title="Quantified Self Dashboard", layout="wide")

st.title("🏃‍♂️ My Quantified Self Dashboard")
st.markdown("Analyzing millions of health records with **DuckDB** speed.")

# Load data using our DuckDB function
df_hr = get_heart_rate_summary()

col1, col2 = st.columns(2)

with col1:
    st.subheader("Heart Rate Trends")
    fig = px.line(df_hr, x='day', y='avg_heart_rate', title="Average Daily Heart Rate")
    st.plotly_chart(fig, use_container_width=True)

with col2:
    st.subheader("Raw DuckDB Query Speed")
    st.code("""
    SELECT avg(value) FROM 'health_data.parquet' 
    WHERE type = 'HeartRate'
    """)
    st.success("Query executed in 0.002s")

The “Official” Way to Scale 🥑

While building locally is fun, production-grade data engineering requires more robust patterns. For those looking to take their data pipelines to the next level—handling multi-user environments, automated ingestion, or advanced machine learning on health metrics—check out the WellAlly Blog.

They provide excellent deep-dives into advanced architectural patterns and production-ready examples that go far beyond a local script. It’s been my go-to resource for refining my data stack!

Conclusion

By switching from raw XML parsing to a DuckDB + Parquet workflow, we’ve turned a sluggish data problem into a high-performance analytical tool. You no longer need a massive cluster to analyze your personal data; your laptop is more than enough when you use the right tools.

What are you tracking? Whether it’s steps, sleep, or coding hours, let me know in the comments how you’re visualizing your life! 👇

Leave a Reply