Build an self-improving AI agent that turns documents into structured data (with LangGraph)

Project: Unstructured to structured

What this AI agent actually does?

This self-improving AI agent takes messy documents (invoices, contracts, medical reports, whatever) and turns them into clean, structured data and CSV tables. But here’s the kicker – it actually gets better at its job over time.

Example of the power of this AI agent:

Input:

Image 1

Image 2

What you get as output:

Structured data (csvs files), with ultra accuracy:

More Csvs Files

Also you get

JSON file for each document, structured data as a JSON format, following a general schema para todos, llm decide cual es el mejor esquema, dependiendo el tipo de documento que hayas ingresado, so you can manipulate your data in different ways

{
  "header": {
    "document_title": {
      "value": "Purchase Order",
      "normalized_value": "Purchase Order",
      "reason": "Top-right prominent heading reads 'Purchase Order' (visual title header).",
      "confidence": 0.95
    },
    "purchase_order_number": {
      "value": "PO000495",
      "normalized_value": "PO000495",
      "reason": "Label 'PO No: PO000495' printed near the header on the right; matched to schema synonyms 'PO No' / 'Purchase Order #'.",
      "confidence": 0.95
    },
    "po_date": {
      "value": "04/26/2017",
      "normalized_value": "2017-04-26",
      "reason": "Date '04/26/2017' directly under PO number in header; normalized to ISO-8601.",
      "confidence": 0.95
    },
  "parties": {
    "bill_to_name": {
      "value": "PLANERGY Boston Office",
      "normalized_value": "PLANERGY Boston Office",
      "reason": "Top-left block under the company logo lists 'PLANERGY' and 'Boston Office' — interpreted as the billing/requesting organization.",
      "confidence": 0.88
    },
...
 "items": {
    "items": {
      "value": [
        {
          "item_name": "Nescafe Gold Blend Coffee 7oz",
          "item_description": null,
          "item_code": "QD2-00350",
          "sku": "QD2-00350",
          "quantity": 1.0,
          "unit": null,
          "unit_price": 34.99,
          "discount": 0.0,
          "line_total": 34.99,
          "currency": "USD"
        },
        {
          "item_name": "Tettley Tea Round Tea Bags 440/Pk",
          "item_description": null,
          "item_code": "QD2-TET440",
          "sku": "QD2-TET440",
          "quantity": 1.0,
          "unit": null,
          "unit_price": 20.49,
          "discount": 0.0,
          "line_total": 20.49,
          "currency": "USD"
        },
...

What my AI agent actually does (and why it’s pretty cool)

Unstructured to Structured: This self-improving AI agent takes messy documents (invoices, contracts, medical reports, whatever) and turns them into clean, structured data and CSV tables. But here’s the kicker – it actually gets better at its job over time.

So I built this thing called “Unstructured to Structured”, and honestly, it’s doing some pretty wild stuff. Let me break down what’s actually happening under the hood.

The problem I was trying to solve

You know how most AI agents are pretty static? They follow some instructions, give you an output, and if an error occurs nothing happens until the engineers modify something manually. My agent has super powers – it actually improves itself! Literally, it has autonomous fixes.

Like, imagine you have a bunch of invoices, you upload them and the AI processes them into structured data. Most AI tools would just give you the data and if there’s an error about field extraction or schema inference, nothing happens. Mine actually analyzes the documents, infers the schema, extracts the data, and if the LLM drifts, for example the field mapping was wrong 😱, it actually detects the issue and fixes itself so it never happens again.

Full code open source at: https://github.com/your-username/handit-examples

Let’s dive in!

Table of Contents

  • What my AI agent actually does (and why it’s pretty cool)
  • The problem I was trying to solve
  • 1. Architecture Overview
  • 2. Setting Up Your Environment
    • Backend
    • Frontend
  • 3. The Core: LangGraph Workflow 🧠
  • 4. Node Classes: Specialized Tools for Every Task 🎯

    • Inference Schema Node – the schema detective
    • Invoice Data Capture Node – the data extractor
    • Generate CSV Node – the table builder
  • 5. The self-improvement (Best Part)

    • 1. Let’s setup Handit.ai observability
    • 2. Set up evaluations
    • 3. Set up self-improvement (very interesting part)
  • 6. Results
  • 7. Conclusions

1. Architecture Overview

Let’s understand the architecture of our Unstructured to Structured AI agent:

[Document Upload] → [Schema Inference] → [Data Extraction] → [CSV Generation]
       ↓                    ↓                    ↓                    ↓
   FastAPI Server    LangGraph Node      LangGraph Node      LangGraph Node
   (Rate Limited)    (AI Schema)         (AI Extraction)     (AI Tables)
       ↓                    ↓                    ↓                    ↓
   Handit.ai         Handit.ai           Handit.ai           Handit.ai
   Tracing           Tracing             Tracing             Tracing

This architecture separates concerns into distinct nodes:

  • FastAPI Server: Handles document uploads with rate limiting and CORS protection
  • Schema Inference Node: Uses AI to analyze all documents and create a unified JSON schema
  • Data Extraction Node: Maps document content to the inferred schema using AI
  • CSV Generation Node: Creates structured tables and CSV files from the extracted data
  • Handit.ai Integration: Every step is traced, evaluated, and can be automatically improved

2. Setting Up Your Environment

Backend

1. Clone the Repository

git clone https://github.com/your-username/handit-examples.git
cd handit-examples/examples/unstructured-to-structured

2. Create Virtual Environment

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# On macOS/Linux:
source .venv/bin/activate

# On Windows:
.venvScriptsactivate

3. Install Dependencies

# Install dependencies
pip install -r requirements.txt

4. Environment Configuration

# Copy environment example
cp .env.example .env

5. Configure API Keys

Edit the .env file and add your API keys:

# Required API Keys
# Get your API key from: https://platform.openai.com/api-keys
OPENAI_API_KEY=your_openai_api_key_here
# Get your API key from: https://www.handit.ai/
HANDIT_API_KEY=your_handit_api_key_here

# Optional Configuration
OPENAI_MODEL=gpt-4o-mini
RATE_LIMIT_REQUESTS=100
RATE_LIMIT_WINDOW=3600

6. Run the Application 🚀

Development Mode

# Make sure virtual environment is activated
source .venv/bin/activate  # macOS/Linux
# or
.venvScriptsactivate     # Windows

# Start the FastAPI server
python main.py

The server will start on http://localhost:8000

Frontend (Optional)

You can test the API directly using the FastAPI docs at http://localhost:8000/docs, or build a simple frontend to upload documents.

3. The Core: LangGraph Workflow 🧠

Think of it as a smart pipeline that processes documents step by step. Here’s what happens:

  1. You upload documents – like invoices, contracts, medical reports (any format)
  2. The agent analyzes everything – it looks at all your documents and figures out the best structure
  3. It creates a unified schema – one JSON schema that can represent all your documents
  4. Then extracts the data – maps each document to the schema with AI
  5. Finally builds tables – creates CSV files and structured data you can actually use

Here’s the main workflow:

# The LangGraph workflow
def create_workflow():
    workflow = StateGraph(GraphState)

    # Add nodes
    workflow.add_node("inference_schema", inference_schema)
    workflow.add_node("invoice_data_capture", invoice_data_capture)
    workflow.add_node("generate_csv", generate_csv)

    # Define the flow
    workflow.set_entry_point("inference_schema")
    workflow.add_edge("inference_schema", "invoice_data_capture")
    workflow.add_edge("invoice_data_capture", "generate_csv")

    return workflow.compile()

4. Node Classes: Specialized Tools for Every Task 🎯

Inference Schema Node – the schema detective

This is where the magic starts. When you upload documents, this thing:

  1. Analyzes all your documents – it reads images, PDFs, text files
  2. Figures out the structure – it creates a unified JSON schema that fits everything
  3. Handles any document type – invoices, contracts, medical reports, whatever
  4. Creates a smart schema – with synonyms, field types, and reasoning
def inference_schema(state: GraphState) -> Dict[str, Any]:
    # Build multimodal message with all documents
    human_message = _build_multimodal_human_message(unstructured_paths)

    # Ask the LLM to infer the schema
    schema_result = schema_inferencer.invoke({"messages": [human_message]})

    # Track everything with Handit.ai
    tracker.track_node(
        input={"systemPrompt": get_system_prompt(), "userPrompt": user_prompt_summary, "images": image_attachments},
        output=inferred_schema,
        node_name="inference_schema",
        agent_name=agent_name,
        node_type="llm",
        execution_id=execution_id
    )

Invoice Data Capture Node – the data extractor

For each document, it maps the content to your schema:

def invoice_data_capture(state: GraphState) -> Dict[str, Any]:
    # Load the inferred schema
    inferred_schema = state.get("inferred_schema")

    # Process each document
    for invoice_path in invoices_paths:
        # Create multimodal input (text + images)
        messages = [HumanMessage(content=[
            {"type": "text", "text": "Map the document to the provided schema..."},
            {"type": "image_url", "image_url": {"url": data_url}}
        ])]

        # Extract data using AI
        extraction_result = invoice_data_extractor.invoke({
            "messages": messages, 
            "schema_json": schema_json_text
        })

        # Track with Handit.ai
        tracker.track_node(
            input={"systemPrompt": get_system_prompt(), "userPrompt": get_user_prompt(), "images": image_attachments},
            output=result_dict,
            node_name="invoice_data_capture",
            agent_name=agent_name,
            node_type="llm",
            execution_id=execution_id
        )

Generate CSV Node – the table builder

Finally, it creates structured tables from all your data:

def generate_csv(state: GraphState) -> Dict[str, Any]:
    # Load all the extracted JSON data
    all_json_data = []
    for json_path in structured_json_paths:
        with open(json_path, "r", encoding="utf-8") as f:
            json_data = json.load(f)
        all_json_data.append({"filename": filename, "data": json_data})

    # Ask the LLM to create tables
    llm_response = csv_generation_planner.invoke({
        "documents_inventory": all_json_data
    })

    # Generate CSV files
    generated_files = _save_tables_to_csv(tables, output_dir)

    # Track with Handit.ai
    tracker.track_node(
        input={"systemPrompt": get_system_prompt(), "userPrompt": get_user_prompt(), "documents_inventory": all_json_data},
        output={"tables": tables, "plan": plan, "generated_files": generated_files},
        node_name="generate_csv",
        agent_name=agent_name,
        node_type="llm",
        execution_id=execution_id
    )

Want to dive deep into the nodes and prompts? Check out the full open-source code!

5. The self-improvement (Best Part)

Here’s the really cool thing – this AI agent actually gets better over time. Here is the secret weapon Handit.ai

Every action, every response is fully observed and analyzed. The system can see:

  • Which schema inferences worked well
  • Which data extractions failed
  • How long processing takes
  • What document types cause issues
  • When the LLM makes mistakes
  • And more…

And yes sir! When this powerful tool detects any mistakes it fixes automatically.

This means the AI agent can actually improve itself. If the LLM extracts the wrong field or generates incorrect schemas, Handit.ai tracks that failure and automatically adjusts the AI agent to prevent the same mistake from happening again. It’s like having an AI engineer who is constantly monitoring, evaluating and improving your AI agent.

To get self-improvement we need to accomplish these steps:

1. Let’s setup Handit.ai observability

This will give us full tracing to see inside our llm’s and tools to understand what they’re doing.

Notice this project comes configured with Handit.ai Observability, you’ll only need to get your own API token. Follow these steps:

1. Create an account here: Handit.ai

2. After creating your account, get your token here: Handit.ai token

3. Copy your token and add it to your .env file:

HANDIT_API_KEY=your_handit_token_here

Once you have accomplish this step, every time you upload documents and process them, you will get full observability in the Handit.ai Tracing Dashboard

2. Set up evaluations

1. Add your AI provider (OpenAI, GoogleAI, etc.) token here: Token for evaluation

2. Assign evaluators to llm’s nodes here: Handit.ai Evaluation

For this project specially assign:

  • Correctness Evaluation to invoice_data_capture – this will evaluate the accuracy of data extraction
  • Schema Validation to inference_schema – this will evaluate the quality of schema inference
  • Data Quality to generate_csv – this will evaluate the CSV generation quality

3. Set up self-improvement (very interesting part)

1. Run this on your terminal:

npm install -g @handit.ai/cli

2. Run this command and follow the terminal instructions – this connects your repository to Handit for automatic PR creation!:

handit-cli github

✨ What happens next: Every time Handit detects that your AI agent failed, it will automatically send you a PR to your repo with the fixes!

This is like having an AI engineer who never sleeps, constantly monitoring your agent and fixing issues before you even notice them! 🤖👨‍💻

6. Results

To test the project, first you need to upload some documents (jpg, png, pdf, jpeg) using the API endpoint

First test: Upload 2-3 invoices and let the AI process them

Result: You’ll get:

  • A unified JSON schema that fits all your documents
  • Structured data extracted from each document
  • CSV files with your data organized in tables

Second test: Check Handit.ai dashboard to see the full tracing

Result: You’ll see exactly how the AI processed each document, what prompts it used, and how it made decisions.

Third test: If there are any errors, Handit.ai will detect them and automatically create PRs to fix your agent!

7. Conclusions

Thanks for reading!

I hope this deep dive into building a self-improving AI agent for document processing has been useful for your own projects.

The project is fully open source – feel free to:
🔧 Modify it for your specific document types (receipts, forms, reports, etc.)
🏭 Adapt it to any industry (healthcare, finance, legal, retail, etc.)
🚀 Use it as a foundation for your own AI agents
🤝 Contribute improvements back to the community

Full code open source at: https://github.com/your-username/handit-examples

This project comes with Handit.ai configured. If you want to configure Handit.ai for your own projects, I suggest following the documentation: https://docs.handit.ai/quickstart

What new feature should have this project? Let me know in the comments! 💬

Key Features:

  • 🧠 Smart Schema Inference: Automatically creates unified schemas for any document type
  • 🔍 Multimodal Processing: Handles images, PDFs, and text files
  • 📊 Structured Output: Generates clean JSON and CSV files
  • 🚀 Self-Improving: Automatically fixes issues using Handit.ai
  • 🛡️ Production Ready: Rate limiting, error handling, and comprehensive logging
  • 🔄 LangGraph Workflow: Modern, scalable AI agent architecture

Perfect for:

  • Document processing automation
  • Data extraction from unstructured sources
  • Invoice and receipt processing
  • Contract analysis and data extraction
  • Medical report processing
  • Any document-to-data conversion task

Leave a Reply