When two students' essays share metadata fingerprints

Two essays come in for the same assignment. Visible content is different — the prose doesn't match, the arguments don't match, the citations don't match. But when you look at the file metadata, certain fields are identical: same creator name, same Application string, same revision-save ID. That's a fingerprint match, and it tells you something specific about how those two files came together.

This page walks through the kinds of metadata fingerprints that can match across submissions, what each kind means, and how to respond.

Why metadata fingerprints exist at all

A .docx file carries dozens of fields beyond the visible content. Most of them get filled in automatically as the user works:

The OS user account writes its name into creator.
Word writes its version string into Application.
Word generates a random ID called an rsid (revision-save ID) on first save and embeds it in every paragraph touched in that session.
The local style definitions get copied from Normal.dotm (the global template).

Most of these fields are stable across a single student's submissions — same OS account, same Word version, same global template. So one student's papers carry consistent fingerprints across their whole semester.

When two students' papers carry an identical fingerprint, it means their files share an origin point.

Three kinds of shared fingerprints

1. Shared `creator` field

Easiest to spot. Two submissions where <dc:creator>Robert Smith</dc:creator> is identical across both files. Plausible explanations:

Shared workstation. Roommates, family, library computer. Common at certain institutions and impossible at others. Ask the students about their workstation setup.
One student wrote both files. A friend wrote one essay for another student and didn't change the metadata. The visible content can be tweaked enough to avoid prose-style detection while leaving the metadata unchanged.
Both files passed through the same tutor. Less suspicious but still worth knowing.
Template handoff. Both students started from a template the instructor provided. creator reflects the instructor in that case.

Action: ask each student where they wrote the file. The truth usually comes out in the answer.

2. Shared `rsid` (revision-save ID)

This is the smoking-gun fingerprint. Word generates a random ID at first save of each session. Every paragraph touched during that session is tagged with that ID in word/document.xml. The IDs are stored in word/settings.xml for the document.

When two documents share an rsid, they were edited in the same Word session. There is essentially no innocent way for this to happen across two independently-authored student papers.

The forensics scoring engine reads settings.xml and document.xml and surfaces rsid overlap as a structural signal. If the report says "this submission shares 12 of 14 rsids with another submission in this class," that's evidence of shared editing — both papers were worked on in the same Word session.

Plausible explanations (very narrow):

The students were genuinely collaborating with the instructor's permission (group project, peer-review session).
One student "helped" the other by typing in their paper for them — same file on the same machine, just saved twice with different content.

Anything beyond those is hard to defend.

3. Shared font subset prefix (PDF)

Covered in how to scan a PDF for tampering. One-line: when two PDFs share an identical embedded font subset prefix, the file resources are not independent. They came from the same PDF generation event or from a shared template.

For docx, the analogous signal is shared style definitions in styles.xml. Word's default styles are identical across a Word version, so two random Word docs will share most style definitions. But custom styles (a paragraph style someone created and named) propagate when files are copied. Two student papers carrying identical custom styles named "MyHeading1" and "MyHeading2" suggest a shared origin template.

4. Shared `Template` field

In app.xml, the <Template> element names the template the document was based on. Default is Normal.dotm (boring; doesn't tell you anything). A non-default template name (e.g., Class-Term-Paper-v3.dotm) shared across two submissions means both students worked from the same custom template.

Could be: an instructor-provided template (legitimate). Could be: a paper-mill template (less legitimate).

How the rolling baseline reads these

The forensics scoring engine maintains a per-paper rolling baseline — for each assignment, it compares each submission against the others to surface outliers. Fingerprint sharing flows through this:

Two submissions sharing creator + Application + Template is a low-severity signal (consistent with "everyone used the instructor's template on their personal laptops"; could also be a shared workstation).
Two submissions sharing rsids is a medium- or high-severity signal (genuinely rare without shared editing).
Three or more submissions sharing rsids is high-severity (paper-mill / class-collusion pattern).

The scoring engine doesn't accuse; it surfaces. The instructor still does the analysis.

What to do when you see a fingerprint match

Identify which kind of fingerprint matches. Creator-only vs Application-only vs rsid is three very different stories.
Ask each student separately, not together. Independent answers are more informative than a joint conversation where they coordinate.
If they had a legitimate reason (group project, lab partners, study group sharing a template), they'll mention it unprompted. If they didn't, the story usually doesn't hold together.
Document the structural evidence. Screenshot the rsid overlap or shared creator field. This is concrete evidence that's much easier to defend in an integrity hearing than "the prose felt off."

What this can't tell you

Who actually wrote the prose. A shared rsid says "edited in the same session"; it doesn't say "the same person wrote both papers." Maybe Student A typed Student B's outline into the file. Maybe Student B dictated and Student A typed. The fingerprint surfaces the workflow, not the authorship.
What's missing from the comparison set. A class of 30 students where 2 share rsids is suspicious; a class of 200 where 4 share rsids is suspicious if those 4 are all in the same TA section but less so if they're scattered.
Whether it was intentional. Students sometimes share files innocently — "I forgot how to format this, can I see yours?" Then they type over it. Fingerprints remain. Worth asking about, not assuming.

What Autotend Forensics surfaces

The structural-signals scoring layer reads:

creator + lastModifiedBy overlap across submissions (low-severity).
rsid overlap (medium- or high-severity depending on count and proportion).
Custom-style and Template-name overlap (low-severity).
Embedded resource overlap in PDF (medium-severity).

All as observations. The instructor still does the work of asking why.

For the full structural-signal methodology, see structural signals.