Why prose-style AI detection is biased without per-student baselines

A class of tools — AI-detection services, similarity scanners, prose analyzers — claim to read a piece of writing and tell you the probability it was written by ChatGPT. Some report a number like "78% AI." Others use friendlier framing like "this writing patterns matches LLM output." All of them are doing the same thing: running statistical detectors over the prose itself — vocabulary choices, sentence-length variance, transition phrases, formality, contraction use, paragraph uniformity — and outputting a verdict.

This page walks through why those tools are unreliable when used without a per-student writing baseline, what specific kinds of writing they misfire on, what the research says, and what we do at Autotend instead.

The bias mechanism

A prose-style detector trained on "AI writing vs. human writing" learns the average pattern of each class. The "AI" class is dominated by ChatGPT-style output — formal register, consistent sentence rhythm, balanced paragraph structure, polished vocabulary, low contraction rate, frequent hedging phrases ("it's worth noting…", "in conclusion…"). The "human" class is dominated by typical English-native undergraduate writing — looser register, more sentence-length variance, more colloquialism, more typos, less polish.

Then the detector encounters a real student's submission. The detector's verdict is not "did a human write this?" — it's "does this writing pattern look more like the average AI sample or the average human sample in my training set?"

That framing breaks for any writer whose natural writing pattern happens to resemble the AI side of the distribution:

Non-native English speakers write with more formal register, more textbook-derived vocabulary, lower contraction rate. These are the same signals the detector flags as AI.
Writers coached on academic prose — students who've taken formal writing classes, students from cultures with more formal academic-register expectations — show the same pattern.
Students writing in their second or third language often produce sentences with more uniform structure as they reach for grammatical correctness rather than rhythmic variation.
STEM writers trained on dense, hedge-heavy, citation-rich prose look statistically similar to ChatGPT trained on the same kind of source material.
Students whose first draft was edited heavily (by Grammarly, by a tutor, by themselves) lose the variance that detectors use as "human" signal.

None of those students used AI. All of them get flagged.

What's been documented

Several independent things have been documented about prose-style AI detectors:

They false-positive at notably higher rates on writing from non-native English speakers than from native-English writers, with no actual change in authorship.
Some prose-style detectors have published or implied false-positive rates well above what would be acceptable for an academic-integrity tool — particularly on shorter texts and on writing in formal academic register.
At least one major detector built by an AI lab was withdrawn by its creators citing inadequate accuracy.
Multiple universities — including some that had licensed institutional AI-detection features when those launched — have published guidance to faculty against using those tools as the sole basis for an academic-integrity finding.
Public legal complaints exist from students claiming wrongful accusation based on AI-detection output.

We're keeping this section conservative on purpose. If you want to read the specific studies and incidents, search "AI detection ESL bias study" or "Turnitin AI detection faculty guidance" — the literature is large enough to find on your own, and we'd rather you read primary sources than take our cherry-picked numbers on trust. The thing that matters for this page is that the bias mechanism is real and replicable, and is the central problem — not a corner case.

Why this isn't a "make a better classifier" problem

You can't train your way out of this with more data. The bias is baked into the framing. The detector is asking "does this writing pattern match the AI-trained statistical signature?" and any human writer whose natural style overlaps with that signature will be flagged.

The only honest fix is to change what the detector is being asked. Instead of "is this AI?", ask: "is this writing pattern different from how this specific student writes?" That comparison — to the student's own prior work — is statistically defensible because every student is their own baseline. A sudden shift away from how this student normally writes is genuine signal. A student writing in their typical style is no longer false-flagged simply because their typical style happens to look formal.

This requires per-student writing history. You need at least a few of the student's prior submissions, ideally from un-pressured contexts (in-class writing, drafts), to build the baseline. Without that data, you cannot do this. With it, prose-style analysis becomes useful — though still not absolute proof of anything.

What we do at Autotend Forensics, by default

Autotend Forensics ships structural-only by default. Every signal we surface in the free PWA and the paid Solo/Class tiers has a file-structural basis, not a prose-style judgment:

Metadata fields: author, edit time, application, revision count, timestamps
Edit history: revision marks, hidden text, tracked-changes residue
File-structural anomalies: ZIP shape, embedded font hashes, export-source fingerprints, character-encoding mismatches
Paste detection: large unbroken text blocks, missing edit-time-per-word, font-signature breaks across paragraph boundaries
AI-paste timing signals: edit time totaling 4 minutes for 2,000 words, revision count of 1, etc.

These signals work on the file, not the prose. A non-native English speaker submitting their own work has the same file-structural signature as a native speaker submitting their own work. The bias mechanism described above doesn't apply.

So what do we do?

We don't run prose-style detection at all. We surface signals from the file itself — metadata, edit history, paste patterns, font and encoding signatures, embedded objects, timeline anomalies. The file's structural fingerprint is unaffected by who wrote it or in what language they think; the bias mechanism described above doesn't apply. Every signal we surface has a documented methodology, lives on a page like this one, and is presented as evidence for your judgment — not as a verdict.

If we ever bring linguistic analysis back into the product, it will only be after we have the per-student baselines that make it defensible — that's the Autotend LMS roadmap, and we'll be explicit about the limits when we ship it.

The honest TL;DR

Prose-style AI detection without baselines is statistically biased against non-native English writers and well-coached students. The bias is well-documented.
We don't run it. Not in the free tier, not in the paid tier, not as an add-on.
We do surface structural signals — file metadata, edit history, paste patterns — which are unaffected by the bias mechanism.
If anyone tells you a number like "this is X% AI" based on reading the prose itself, ask them what their false-positive rate is on ESL writers. The honest answer is they don't know, because no one running prose-style detection without baselines does.

We surface signals. The professor judges. The file structure is the most reliable ground we have. That's the product.

See your own document's structural signals at forensics.autotend.io →

What docx metadata reveals about a document — the structural signals that are defensible
See the full list at Learn