Building an AI Humanizer: why we stopped trying to fix prompts

This post is about a mistake we made early on: assuming that “unnatural” LLM output could be fixed at the prompt level.

It can’t. At least not reliably.

What finally worked for us was treating LLM text as a signal-processing problem at the sentence level, not a generation problem.

The signal we kept measuring 📊

We started from AI detection work, which forced us to look at text statistically instead of stylistically.

Across different LLMs and prompts, flagged samples shared the same low-level traits:

  • Sentence length variance was abnormally low
  • Clause depth was consistently shallow
  • Discourse markers repeated with high frequency
  • Sentence openers followed predictable templates

None of these are errors.

But together, they form a pattern.

When we plotted sentence length distributions, human-written text had long tails.

LLM text clustered tightly around the mean.

That clustering turned out to be a stronger signal than vocabulary choice.

Why prompts failed at fixing this 😐

Prompt instructions like:

“Vary sentence length”

“Write more naturally”

operate at generation time, but they don’t constrain local structure.

In practice, prompts affected:

  • word choice
  • tone
  • politeness

They barely affected:

  • sentence rhythm
  • transition placement
  • redundancy density

Worse, prompt changes introduced instability. Small edits caused large global shifts, which made debugging impossible.

From an engineering standpoint, that was a dead end.

Reframing the problem 🔁

We stopped treating LLM output as “final text”.

Instead, we treated it as raw material.

That led to a two-stage pipeline:

  1. Generation — optimize for clarity and correctness
  2. Sentence-level rewriting — optimize for distribution and flow

The second stage is what later became the AI Humanizer.

What sentence-level rewriting actually does 🧩

This is not paraphrasing everything.

We only touch sentences that trip specific heuristics:

  • length similarity above a threshold
  • repeated syntactic openers
  • excessive connective phrases
  • over-explained subordinate clauses

Rewrites are local:

  • split a sentence
  • compress another
  • delete a transition
  • reorder clauses

Semantics stay fixed.

Distribution changes.

That distinction matters.

Why this works better technically ⚙️

Because it’s measurable.

After rewriting, we can observe:

  • increased sentence length variance
  • reduced opener repetition
  • lower transition density
  • more human-like rhythm curves

This makes the system debuggable.

Prompts are opaque.

Post-processing isn’t.

Where the AI Humanizer fits 🧠

This approach eventually became the AI Humanizer inside Dechecker — not as a detector workaround, but as a controllable post-processing layer.

It has clear limits:

  • it won’t fix weak arguments
  • it can over-flatten voice if pushed too hard
  • different domains need different thresholds

But unlike prompt tuning, we can see exactly what changed and why.

Why this matters beyond detection 👀

Even if detectors didn’t exist, this problem would.

Uniform structure is tiring to read. Humans subconsciously expect irregularity. Sentence-level rewriting restores that irregularity without changing meaning.

From a systems perspective, it’s simply the right abstraction level.

Final takeaway ✅

If LLM-generated text feels unnatural, the issue is rarely what the model says.

It’s how evenly it says it.

Prompts don’t fix distributions.

Rewriting does.

Leave a Reply