Are We Over-Engineering LLM Stacks Too Early?

I’ve been building with LLMs for a while now, and I keep noticing the same pattern.

A project starts simple.

response = client.responses.create(
    model="gpt-4.1",
    input="Summarize this document"
)

It works. It feels magical.

A few weeks later, the architecture diagram looks like this:

User
  ↓
Prompt Builder
  ↓
Context Aggregator
  ↓
Vector DB (Embeddings)
  ↓
Retriever
  ↓
Model Router
  ↓
LLM
  ↓
Post-Processor

And this is before product-market fit.

It makes me wonder whether we’re solving real problems or just future-proofing imaginary ones.

The First Real Friction Isn’t Intelligence

The first thing that usually breaks isn’t reasoning quality.

It’s cost and context.

Suddenly you realize your “simple” request is actually sending:

{
  "system": "... 600 tokens ...",
  "chat_history": "... 2,800 tokens ...",
  "retrieved_chunks": "... 4,200 tokens ...",
  "user_input": "Explain this"
}

And you’re wondering why the bill doesn’t match your mental math.

Most early issues aren’t about model capability. They’re about what we’re sending to it.

Before touching architecture, I sometimes sanity-check prompts with simple token estimators. I’ve occasionally used tools like https://aitoolskit.io to review token counts and compare model pricing. Nothing fancy just clarity on how many tokens I’m actually burning.

Sometimes the insight is embarrassingly simple:

You are a helpful assistant specialized in summarization.
You are a helpful assistant specialized in summarization.
You are a helpful assistant specialized in summarization.

Repeated instructions. Hidden token leakage.

Token awareness alone has changed more of my architectural decisions than switching models ever did.

RAG: Necessary or Premature?

RAG is powerful. But I’ve also seen it introduced before it was truly needed.

A typical RAG setup looks something like:

chunks = chunk_document(document, size=800)
embeddings = embed(chunks)
store(embeddings)

query_embedding = embed(user_query)
context = retrieve_similar(query_embedding, top_k=5)

response = llm.generate(context + user_query)

Elegant in theory.

But each step adds:

  • Embedding cost
  • Storage cost
  • Chunking decisions
  • Retrieval tuning
  • Evaluation overhead

Sometimes that’s justified.

Sometimes the knowledge base is small enough that static context would work. Or simple caching would solve most of it. Or trimming the prompt would reduce the need for retrieval entirely.

Complexity compounds quickly.

The Optimization Reflex

I’ve caught myself optimizing token efficiency for features that didn’t even have users yet.

  • Reducing 4,200 tokens to 3,600 tokens.
  • Switching models to save 0.002 per request.
  • Designing fallback routing logic.

All before validating whether the output itself mattered.

Classic engineer reflex.

Genuinely Curious

  • When did complexity become necessary for you?
  • At what point did token cost become painful enough to justify additional layers?
  • If you rebuilt your stack from scratch, what would you deliberately not add this time?

It feels like we’re collectively figuring this out in real time.

Would love to hear how others are navigating it.

Leave a Reply