Methodology · Paste detection

Paste-detection windows in DOCX — what we look for in revisions.xml

Most docx files carry a word/revisions.xml autosave history — a record of incremental changes Word made on the way to the final saved state. The structure of that history says a lot about whether the document was typed into Word over time or pasted into Word in a small number of large blocks. The paste-detection signal in Autotend Forensics reads exactly this.

What's in revisions.xml

revisions.xml exists when Word has performed autosaves during an editing session. Each autosave produces a <rev> entry with:

  • A timestamp.
  • A diff against the prior autosave: which runs of text were added or removed.
  • The author identity associated with that revision (<rsidR> references that resolve through word/settings.xml).

Walked in time order, the file gives you a roughly minute-by-minute view of how the document grew.

Not every docx has a revisions.xml. It's generated when Word's autosave engine has actually run — which means the document existed long enough for autosave to fire (default ~10 minutes between autosaves, can vary). Documents finished and saved in a single sitting may not have it.

What "paste-shaped" looks like

The paste-detection heuristic in the scoring engine looks for:

  1. Large insertion blocks at single timestamps. A <rev> entry that adds 800+ words in one tick is paste-shaped. A <rev> entry that adds 80 words is typing-shaped.
  2. Gaps in the timeline that don't match active editing. If the document grew from 200 words to 1,800 words between 14:32 and 14:34, the human wasn't typing 800 words per minute. Something arrived in bulk.
  3. Single-burst insertion followed by minor edits. The classic pattern: 14:30, document is empty; 14:31, document is 1,500 words; 14:45 through 16:00, small fixes — a comma here, a sentence rewrite there. The "minor edits at the end" pattern is what makes this distinguishable from "pasted at submission time then forgotten" (which has no follow-up edits).

The detector measures these as histograms across the revisions timeline, not as binary "yes-paste / no-paste" decisions.

What "typing-shaped" looks like

For comparison: a document typed in Word over an evening typically produces:

  • Many small <rev> entries (50–300 words each).
  • A roughly continuous timestamp pattern over the writing session.
  • A mix of insertions and small deletions, reflecting normal typing errors and revisions.
  • An ins/del ratio that drifts toward more deletions as the writer revises.

The pattern is messy, granular, and uneven — humans take breaks, get stuck, type in bursts. But the bursts are typing-burst-sized (50–300 words), not paste-burst-sized (800+).

What this signal can and can't tell you

Can

  • Distinguish a paste-from-elsewhere workflow from a typed-in-Word workflow.
  • Detect the "single paste at the start, then minor edits" pattern that's common when text was drafted in another tool (Google Docs, an LLM, a notes app) and brought into Word for final formatting.
  • Provide a granular timeline rather than a single number — instructors can see where in the writing process the paste-shaped block landed.

Can't

  • Identify the source of pasted text. The signal says "this arrived in bulk"; it doesn't say "this is from AI" or "this is from another student". The source is a separate question that requires comparison against the student's other work, or content matching against known sources.
  • Distinguish AI-generated text from other-source text. A paste from a friend's paper looks the same as a paste from Google Docs looks the same as a paste from ChatGPT. The signal is "not typed-in-Word", not "is from AI."
  • Tell you whether the student wrote what they pasted. Many legitimate workflows include drafting in another tool and pasting in. The signal is observation, not verdict.

Common false positives

  • Drafted in Google Docs, pasted in at the end. Most common. Especially common for students on Chromebooks, or anyone who routinely works in Docs.
  • Drafted in a notes app (Notion, Obsidian, Apple Notes) and pasted in. Same shape.
  • Copy-edited in Word after dictating from speech-to-text elsewhere. The dictation tool produced the text; Word received a paste.
  • Re-typed from a paper draft. Less common, but a student who hand-wrote on paper and then typed it up in Word can produce a fast, large-burst insertion that looks paste-shaped because they're reading from a finished text.

These are all observable to the student in their workflow. When a paste signal fires, the most useful first question is "where did you draft this?" — not an accusation.

Common false negatives

  • Pasted in over multiple sessions in smaller chunks. A student who pastes a paragraph at a time over an evening doesn't generate a clean single-burst signal.
  • Documents without revisions.xml at all. Roughly 30% of student submissions have no autosave history because they were finished quickly. The paste-detection signal can't fire on missing data.
  • Documents where Word's autosave was disabled or never triggered. Some institutional Word installs have autosave off by default.

The signal is most reliable when revisions.xml exists and the writing session spans 20+ minutes.

How Autotend Forensics scores this

The paste-detection detector emits two graded observations:

  • Low severity: large single-block insertion observed, but follow-up edits present.
  • Medium severity: large single-block insertion followed by no meaningful incremental edits.

These feed into the per-paper rolling baseline along with the other edit-history signals. The scoring engine never escalates paste-detection alone to a "high-severity" verdict — paste-shaped insertion is too common in legitimate workflows for that.

When this signal is most useful

In combination with other signals:

  • Paste-shape + Word's TotalTime: 1 minute = a strong "the writing happened elsewhere" pattern.
  • Paste-shape + lastModifiedBy matches the student = the student themselves did the paste (consistent with Google Docs export, less consistent with handing off to a third party).
  • Paste-shape + LMS-side timestamp showing the student opened the submission interface for ~3 minutes = consistent with copy-paste-from-elsewhere workflow.

The full set of edit-history signals is detailed in the edit-history methodology page. The paste-detection signal is one piece of that broader picture.