The null input that broke my production agent and what fixed it

The demo ran flawlessly for three weeks. Every test input parsed clean, every output routed correctly, and I thought we had a reliable system.

Then a supplier sent a confirmation email with an empty subject line.

The agent, which was supposed to extract order references and route them into a queue, got a null where it expected a string. It didn’t crash. That would have been better. Instead it generated a plausible-looking order reference, routed it, and the downstream system processed it like it was real. Nobody caught it for four hours.

That is the demo problem: demos use inputs that look like what you expect. Production does not.

What the demo hid

I built and run the agent operation at aienterprise.dk, so I control every layer of the stack. When this broke, I could see the full trace. The agent’s system prompt said “extract the order reference from the subject line.” Sensible instruction. Works every time the subject line exists.

When it doesn’t exist, a well-prompted LLM doesn’t say “I cannot find an order reference.” It fills the gap. It invents something that looks right. The hallucination isn’t random noise. It’s plausible, structured noise. That’s what makes it dangerous. A random failure is easy to catch. A confident, well-formatted wrong answer is not.

In the demo, I never sent an email with a null subject. I never thought to. The input felt so basic I didn’t consider it an edge case. It isn’t an edge case in production. It’s Tuesday.

The unglamorous fix

I didn’t retrain anything. I didn’t adjust the prompt. I added a guard before the model call.

Before the agent touches input now, a deterministic check runs: is the subject field present and non-empty? If not, the message routes to a hold queue with a flag. A human reviews it. The agent never sees the malformed input.

That guard is twelve lines of code. It’s the least interesting thing I built all year. It’s also what makes the agent reliable.

The pattern generalizes. Every place an agent assumes structure in its input is a place production will eventually send you unstructured data. The fix isn’t a smarter model. The fix is a boundary: a check that runs before the model and routes bad input to a human instead of letting the model guess.

This is what I mean when I say reliability is the only feature. A demo proves an agent can do the task. Production proves it does the task, again, on the bad input, at 3am, when no one is watching. Those are different claims. Only the second one matters to anyone paying for it.

The agent now processes roughly 200 routing operations per day without incident. The hold queue gets used about twice a week. When it does, a human looks at whatever weird thing arrived, handles it, and I learn something new about what production actually looks like.

A note for 2027

If you’re building agents for clients in high-risk categories under the EU AI Act, the compliance deadline is December 2, 2027. That covers employment decisions, biometrics, border control, education systems. Not far off.

A system that routes confidently on bad inputs and produces plausible wrong answers won’t survive an audit. The guard I described isn’t just good engineering. For systems in scope, it’s a compliance minimum. The European Commission published draft Article 6 classification guidelines this month. If you haven’t checked whether your system is in scope, now is the time.

Reliability isn’t a feature you add later. The hold queue proves that. The hallucinated order reference proves it too, in the more expensive way.

Leave a Reply