Building an AI Humanizer: why we stopped trying to fix prompts

This post is about a mistake we made early on: assuming that “unnatural” LLM output could be fixed at the prompt level.

It can’t. At least not reliably.

What finally worked for us was treating LLM text as a signal-processing problem at the sentence level, not a generation problem.

The signal we kept measuring 📊

We started from AI detection work, which forced us to look at text statistically instead of stylistically.

Across different LLMs and prompts, flagged samples shared the same low-level traits:

Sentence length variance was abnormally low
Clause depth was consistently shallow
Discourse markers repeated with high frequency
Sentence openers followed predictable templates

None of these are errors.

But together, they form a pattern.

When we plotted sentence length distributions, human-written text had long tails.

LLM text clustered tightly around the mean.

That clustering turned out to be a stronger signal than vocabulary choice.

Why prompts failed at fixing this 😐

Prompt instructions like:

“Vary sentence length”

“Write more naturally”

operate at generation time, but they don’t constrain local structure.

In practice, prompts affected:

word choice
tone
politeness

They barely affected:

sentence rhythm
transition placement
redundancy density

Worse, prompt changes introduced instability. Small edits caused large global shifts, which made debugging impossible.

From an engineering standpoint, that was a dead end.

Reframing the problem 🔁

We stopped treating LLM output as “final text”.

Instead, we treated it as raw material.

That led to a two-stage pipeline:

Generation — optimize for clarity and correctness
Sentence-level rewriting — optimize for distribution and flow

The second stage is what later became the AI Humanizer.

What sentence-level rewriting actually does 🧩

This is not paraphrasing everything.

We only touch sentences that trip specific heuristics:

length similarity above a threshold
repeated syntactic openers
excessive connective phrases
over-explained subordinate clauses

Rewrites are local:

split a sentence
compress another
delete a transition
reorder clauses

Semantics stay fixed.

Distribution changes.

That distinction matters.

Why this works better technically ⚙️

Because it’s measurable.

After rewriting, we can observe:

increased sentence length variance
reduced opener repetition
lower transition density
more human-like rhythm curves

This makes the system debuggable.

Prompts are opaque.

Post-processing isn’t.

Where the AI Humanizer fits 🧠

This approach eventually became the AI Humanizer inside Dechecker — not as a detector workaround, but as a controllable post-processing layer.

It has clear limits:

it won’t fix weak arguments
it can over-flatten voice if pushed too hard
different domains need different thresholds

But unlike prompt tuning, we can see exactly what changed and why.

Why this matters beyond detection 👀

Even if detectors didn’t exist, this problem would.

Uniform structure is tiring to read. Humans subconsciously expect irregularity. Sentence-level rewriting restores that irregularity without changing meaning.

From a systems perspective, it’s simply the right abstraction level.

Final takeaway ✅

If LLM-generated text feels unnatural, the issue is rarely what the model says.

It’s how evenly it says it.

Prompts don’t fix distributions.

Rewriting does.