I Tried to Build an Alexa with Real Memory — Here’s What I Learned After 3 Months of Failure.

A story about LangGraph, memory architecture, and why I stopped fighting LLMs and made the system predictable instead

It Started With a Simple Frustration
I wanted to build something like Alexa — but smarter. Not just a voice assistant that forgets you the moment the session ends. Not an AI that stores your entire conversation history in a text file and calls it “memory.”
I wanted a personal AI that actually knows you — your habits, your preferences, your tasks — and gets smarter over time the way a real assistant would.
Sounds simple. It wasn’t.

Step 1: How Does Alexa Even Work?
Before building anything, I went deep on the Alexa cloud architecture. The model is clean: your voice query goes to the cloud, gets processed, hits an LLM, and the response streams back to the device. The device itself is thin — all the intelligence lives on the server.
Okay. So I needed to build the server layer. But when I started thinking about where memory fits in, I hit the first real wall.
Where does memory live? And more importantly — what even IS memory for a personal AI?

Step 2: What Should a Personal AI Actually Remember?
This is the question most AI projects skip. They just store everything — every message, every session — and call it memory. But that’s just a log file. That’s not memory.
I spent time thinking about what actually matters for a personal AI. What does a good human assistant remember about you?
After a lot of thinking, I landed on four categories:
Identity — who you are, your name, role, basic facts
Habits — things you do regularly, routines
Preferences — how you like things done, what you enjoy
Events & Tasks — things on your calendar, things you need to do
Everything else is noise. Most of what you say to an AI doesn’t need to be stored. This felt like a small insight at the time — it turned out to be the most important design decision in the whole project.

Step 3: Where to Store It — SQL vs Vector DB
Now I had to figure out where to actually store these four types of memory.
My first instinct was a SQL database. Clean tables, structured data, easy to query. But I quickly hit a problem: you can’t query a SQL database with natural language directly. You need to know the exact keys, the exact column names. That doesn’t work when a user says “remind me what I told you about my gym schedule.”
For natural language retrieval, you need vector search — you embed the query and the stored memories as vectors and find semantic matches.
So I ended up with a hybrid:
Postgres (SQL) — for structured memory: identity facts, tasks, calendar events. Things with clear keys you can retrieve directly.
Pinecone (Vector DB) — for semantic memory: habits, preferences, anything you’d retrieve by meaning rather than exact key.
Real data in SQL. Context and meaning in the vector store. Both working together.

Step 4: The First Approach — Just Give the LLM Everything
With the storage figured out, I built version one: give the LLM access to both databases as tools and let it figure out when to read and write.
It was clean in theory. In practice, it was a disaster.
LLMs hallucinate. The model would confidently write memory to the wrong category, retrieve irrelevant things, or — worse — make up memories that didn’t exist. When your system’s entire job is to be a reliable memory layer, hallucination is fatal.
I needed the system to be predictable. Even if it made mistakes, I needed to know where it would make mistakes.

Step 5: The Real Architecture — Nodes, Not Magic
This is when the project started actually working.
Instead of one LLM doing everything, I broke the pipeline into dedicated nodes, each with one job:

User Input
↓
[Segmentation Node]
Splits input into: memory_to_write | memory_to_fetch | ignore
↓
[Classification Node]
Labels each piece: identity | habit | preference | event | task
↓
[Router Node]
├──→ [Memory Writer] → Pinecone + Postgres (parallel)
└──→ [Memory Reader] → Fetch relevant context (parallel)
↓
[Final Answer Node]
Aggregates context → single LLM call → response

Two key decisions that made this work:

Read and write in parallel. Running them sequentially was killing latency. Parallelizing both brought response times down significantly.
Use LLMs only where you have to. Every node that could use regex or deterministic logic instead of an LLM — did. LLMs are expensive in tokens and unpredictable. The classification node, the segmentation logic — wherever I could replace an LLM call with a rule, I did. The only LLM call that has to exist is the final answer generation.
The result: a system that’s predictable end to end. If it gets something wrong, I know which node failed and why. That’s infinitely better than a black box that hallucinates.

Step 6: The Hardware Dream Dies (For Now)
I originally wanted Orion to be a hardware device — a tabletop robot, always listening, always learning. That vision is still there. But 2-3 months in, I made a decision: get the software layer right first.
Hardware is a multiplier. If the memory architecture is broken, a physical device just makes it worse. If the memory architecture is solid, hardware becomes a packaging problem — not a fundamental one.
So Orion is now a software-first memory layer. The hardware will come later, if at all. The memory problem was always the interesting part anyway.

What the Tech Stack Looks Like
LangGraph — orchestration framework, manages the node graph and state
Groq — fast LLM inference for the final answer node
Pinecone — vector storage for semantic memory retrieval
Postgres (Supabase) — structured memory storage
Redis — caching and fast in-session state
Jina — embeddings for vectorizing memory content
LangSmith — tracing and debugging the graph (genuinely essential)
FastAPI — serves the whole thing as a REST API

What I’d Tell Myself 3 Months Ago
The question “what to store” matters more than “how to store.” Most people jump to the tech before answering the design question. Get the design right first.
Latency is a real problem in memory systems. Parallel retrieval and write is not optional — it’s necessary.
Changing the scope is not failure. Dropping the hardware and focusing on the software layer wasn’t giving up. It was focusing.

What’s Next
Orion is still in development. The memory layer works. The next step is making the retrieval smarter — better context injection, memory decay for old/irrelevant entries, and eventually a clean SDK that other developers can drop into their own AI projects.

If you’re building something with LangGraph or agentic memory, I’d genuinely love to talk. The GitHub repo is open: github.com/vivek-1314/orion-py
Pre-final year CSE student. Building things that probably shouldn’t work yet

Leave a Reply Cancel reply