This post is about a mistake we made early on: assuming that “unnatural” LLM output could be fixed at the prompt level.
It can’t. At least not reliably.
What finally worked for us was treating LLM text as a signal-processing problem at the sentence level, not a generation problem.
The signal we kept measuring 📊
We started from AI detection work, which forced us to look at text statistically instead of stylistically.
Across different LLMs and prompts, flagged samples shared the same low-level traits:
- Sentence length variance was abnormally low
- Clause depth was consistently shallow
- Discourse markers repeated with high frequency
- Sentence openers followed predictable templates
None of these are errors.
But together, they form a pattern.
When we plotted sentence length distributions, human-written text had long tails.
LLM text clustered tightly around the mean.
That clustering turned out to be a stronger signal than vocabulary choice.
Why prompts failed at fixing this 😐
Prompt instructions like:
“Vary sentence length”
“Write more naturally”
operate at generation time, but they don’t constrain local structure.
In practice, prompts affected:
- word choice
- tone
- politeness
They barely affected:
- sentence rhythm
- transition placement
- redundancy density
Worse, prompt changes introduced instability. Small edits caused large global shifts, which made debugging impossible.
From an engineering standpoint, that was a dead end.
Reframing the problem 🔁
We stopped treating LLM output as “final text”.
Instead, we treated it as raw material.
That led to a two-stage pipeline:
- Generation — optimize for clarity and correctness
- Sentence-level rewriting — optimize for distribution and flow
The second stage is what later became the AI Humanizer.
What sentence-level rewriting actually does 🧩
This is not paraphrasing everything.
We only touch sentences that trip specific heuristics:
- length similarity above a threshold
- repeated syntactic openers
- excessive connective phrases
- over-explained subordinate clauses
Rewrites are local:
- split a sentence
- compress another
- delete a transition
- reorder clauses
Semantics stay fixed.
Distribution changes.
That distinction matters.
Why this works better technically ⚙️
Because it’s measurable.
After rewriting, we can observe:
- increased sentence length variance
- reduced opener repetition
- lower transition density
- more human-like rhythm curves
This makes the system debuggable.
Prompts are opaque.
Post-processing isn’t.
Where the AI Humanizer fits 🧠
This approach eventually became the AI Humanizer inside Dechecker — not as a detector workaround, but as a controllable post-processing layer.
It has clear limits:
- it won’t fix weak arguments
- it can over-flatten voice if pushed too hard
- different domains need different thresholds
But unlike prompt tuning, we can see exactly what changed and why.
Why this matters beyond detection 👀
Even if detectors didn’t exist, this problem would.
Uniform structure is tiring to read. Humans subconsciously expect irregularity. Sentence-level rewriting restores that irregularity without changing meaning.
From a systems perspective, it’s simply the right abstraction level.
Final takeaway ✅
If LLM-generated text feels unnatural, the issue is rarely what the model says.
It’s how evenly it says it.
Prompts don’t fix distributions.
Rewriting does.
