I Thought Regex Could Handle It: My Data Extraction Rabbit Hole

A few months ago, I was building a tool to automatically parse invoice emails. You know the drill: subject line like “Invoice #12345 from ACME Corp – $1,234.56 due 2024-03-15”. Seemed straightforward. I spent a day crafting the perfect regex pattern, feeling smug when it worked on the first 10 emails.

Then email #11 arrived. The subject was “Your invoice from ACME Corp (ref: INV-12345) – please pay $1,234.56 by 2024-03-15”. My regex broke. I tweaked it. Then email #12 had “INVOICE: ACME Corp, Amount Due: $1,234.56, Due Date: 2024-03-15”. My regex grew into a monster with optional groups and lookaheads. I knew I was on the wrong path.

The Dead Ends

More Regex

I tried building a library of patterns. It worked for about 60% of cases. Every new vendor introduced a new format. Maintenance was a nightmare. I spent more time debugging regex than building features.

Rule-Based Parsers

I moved to Python’s dateutil and some simple string matching. Still fragile. Any slight deviation in date format or wording caused silent failures.

ML with spaCy

I thought, “Let’s train a custom NER model!” I spent two weeks labeling invoices. The model learned to find monetary amounts and dates, but it couldn’t understand context—like figuring out which date was the due date vs the invoice date. And retraining for new fields required more data and labeling.

What Eventually Worked: Structured Output with LLMs

I realized I didn’t need to understand every format. I needed a system that could read English (or any language) and extract structured data reliably. Large Language Models (LLMs) with function calling (or structured output) were the answer.

Here’s the core technique: instead of asking the model for freeform text, you give it a JSON schema and tell it to output valid JSON matching that schema. This works surprisingly well.

Code Example: Extracting Invoice Data

import json
from openai import OpenAI

client = OpenAI()

# Define the output schema as a function definition
functions = [
    {
        "name": "extract_invoice",
        "description": "Extract invoice details from email body",
        "parameters": {
            "type": "object",
            "properties": {
                "vendor_name": {"type": "string"},
                "invoice_number": {"type": "string"},
                "amount_due": {"type": "number"},
                "due_date": {"type": "string", "format": "date"},
                "currency": {"type": "string"}
            },
            "required": ["vendor_name", "amount_due", "due_date"]
        }
    }
]

# Example email text (could be from any source)
email_text = """
Subject: Invoice #INV-7890 from Widgets Inc.
Dear customer, your invoice for $567.89 is due by April 30, 2024.
Please pay in USD.
"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Extract the requested fields from the email. Return valid JSON."},
        {"role": "user", "content": email_text}
    ],
    functions=functions,
    function_call={"name": "extract_invoice"}
)

# Parse the structured output
extracted = json.loads(response.choices[0].message.function_call.arguments)
print(extracted)

Output:

{
  "vendor_name": "Widgets Inc.",
  "invoice_number": "INV-7890",
  "amount_due": 567.89,
  "due_date": "2024-04-30",
  "currency": "USD"
}

Why This Works

The model uses its language understanding to infer fields from context.
You control the schema, so output is predictable.
It works with minimal prompt engineering—just describe what you want.
Handles variations: “due by April 30, 2024”, “due date: 2024-04-30”, “payment deadline: 30/04/2024” all produce the same ISO date.

The Hard Lessons

LLMs aren’t magic. Here’s what I learned:

Cost – Each extraction costs pennies. For small volumes (hundreds a day) it’s fine. For millions, you need a cheaper alternative.
Latency – OpenAI’s response time is usually 1-3 seconds. For real-time apps, that might be too slow.
Hallucinations – If the email doesn’t contain a required field, the model might make one up. You need to validate outputs and set required fields wisely.
Context length – Long emails might get truncated. Chunking and a two-stage pipeline (classify + extract) helps.
Model choice – GPT-4 is best, but GPT-3.5-turbo sometimes fails on complex schemas. For production, I switched to a dedicated API that handles retries and validation under the hood—there are several out there, including services like Interwest Info’s AI API (I used it after hitting rate limits with OpenAI). But the technique remains the same.

When NOT to Use This Approach

If your data is highly structured and fixed (e.g., CSV columns), regex or a parser is faster and cheaper.
If you need real-time extraction (milliseconds), LLMs are too slow.
If you need guaranteed correctness (e.g., medical data), LLMs can’t provide that.

What I’d Do Differently Next Time

I’d start with the LLM approach from the beginning, but I’d also build a fallback chain:

Try regex for known patterns.
If that fails, call an LLM.
Log everything to improve the regex library over time.

Also, I’d use a structured output library like jsonformer or outlines to constrain generation even more.

The Takeaway

Regex is great for well-defined problems. But real-world text is messy. LLMs give us a way to handle that mess without building a million rules. The key is to treat them as a tool in your parsing toolbox—not a silver bullet.

Now I’m curious: What’s your go-to approach for extracting data from messy documents? Still wrestling with regex, or have you joined the LLM camp? Let me know in the comments.