Methodology · Metadata

What docx metadata reveals about a document

A .docx file is a ZIP archive. Unzip one and you'll find a handful of XML files — document.xml for the visible text, plus a set of metadata files that Word, Google Docs, LibreOffice, and Pages quietly maintain as you work. Most people never look at them. Document forensics starts here.

This page walks through which fields exist, which ones are useful, which ones are commonly misread, and what they genuinely can't tell you. Read it through before drawing conclusions about a single document.

What's actually in a docx

Inside the archive, the relevant files are:

  • docProps/core.xml — Dublin Core metadata. Title, subject, creator, last-modified-by, created, modified, revision, last-print, content-status.
  • docProps/app.xml — application-specific stats. Application name and version, total editing time in minutes, words/characters/lines/paragraphs, template, manager, company.
  • docProps/custom.xml — any custom-tagged fields the author or template defined.
  • word/document.xml — the visible content, with revision marks (<w:ins>, <w:del>) and comments if those were tracked.
  • word/comments.xml, word/people.xml — comments and the people who made them, by display name.
  • word/settings.xml — track-changes mode, revision-save settings, document protection state.

Most of those files are produced automatically by the editing application. You don't have to opt in; they get written every time you save.

The fields that actually mean something

creator and lastModifiedBy

The display name (and sometimes email, depending on the application) of whoever first created the document and whoever last saved it. If these don't match the person who submitted the file, that's a signal — but not a verdict. Reasons they might legitimately differ: a student worked from a parent's account on a shared computer; the file was started from a template another student shared; the student emailed it from a different machine that was logged in differently; the file was edited inside a Google Drive owned by someone else.

created and modified

Timestamps written by the application. Compare these against when the assignment was given. A file created before the assignment was announced is unusual; a file modified only minutes before submission with hours of totalTime is common — that's just someone editing right up to the deadline.

totalTime (in app.xml)

The application's count of minutes the file was actively being edited. This is approximate — most applications stop the counter when the window loses focus, but the exact behavior varies. A 2,000-word essay with 6 minutes of total edit time and zero revisions is a strong signal of paste-from-elsewhere. The same essay with 4 hours of edit time and 80 revisions is a strong signal of normal authorship.

The pairing of totalTime and word count and revision count tells you more than any single field. A high word count with low edit time and few revisions is the classic copy-paste pattern.

revision

A counter that the application bumps each time you save. Most editors save every few minutes, so a long essay with a revision count of 1 is unusual.

application and appVersion

Which program wrote the file: Microsoft Office Word, LibreOffice/24.2, Google Docs (exported via Word), Pages-N, WPS Office. Don't read too much into this on its own — students use whatever they have — but a sudden change in application across a student's submissions can be worth noting, and unusual values (WPS Office, Polaris Office, generic XML exporters) sometimes accompany documents that didn't originate where they were claimed to.

template

If a document was started from Normal.dotm, that's what the default Word template is called. If the template field reads something else — particularly a URL or a corporate template name — the file may have come from somewhere other than the student's own Word install.

Comments (word/comments.xml)

Comments are usually visible in the document, but not always. Some applications hide resolved comments, accepted suggestions, or sidebar discussion that never got cleaned up. The XML keeps them.

Track-changes residue (<w:ins>, <w:del>)

If track-changes was ever turned on, the deletion history can persist in the underlying XML even after "accept all" is clicked, depending on how the file was saved. A document with no visible revision marks but with <w:ins> and <w:del> deep in the XML often had its history scrubbed imperfectly.

What metadata can NOT tell you

This is the section that matters most.

  • It can't prove authorship. Knowing the creator is "John Smith" tells you whose copy of Word saved the file. It doesn't tell you who wrote the text.
  • It can't prove AI assistance. Metadata can show patterns consistent with paste-from-elsewhere (low edit time, no revisions, missing typing tempo). That elsewhere might be ChatGPT — or it might be a Google Doc the student wrote, exported to docx, and submitted. Both look identical at this layer.
  • It can't prove copying. Two students with similar timestamps means they submitted around the same time, nothing more.
  • It can be edited. A determined student can rewrite metadata before submission. Most don't, but the absence of suspicious metadata isn't proof of innocence either.

Treat every signal as evidence for judgment, not as a verdict. The metadata is a starting point for a conversation, not a court filing.

How to inspect this yourself

You don't need a forensics tool to look at docx metadata — you can unzip the file and read the XML directly. But if you'd rather just drop the file into a browser and see everything laid out:

Try it free at forensics.autotend.io →

Drop any .docx, .pdf, .xlsx, .pptx, .odt, .rtf, or .pages file. The tool parses everything described on this page — plus paste detection, font and encoding anomalies, hidden tracked changes, and a calibrated AI-suspicion score that's transparent about what it looks at and what it can't see. Nothing leaves your browser; nothing is stored.

What Autotend Forensics looks at, specifically

Every detection in our scorer is documented on a methodology page like this one. We do not assert authorship; we surface signals. The full list of detection signals — what each one looks at, what it can detect, what it commonly misreads — is at Learn.

What we deliberately don't do

We don't run prose-style judgments — vocabulary scoring, sentence-complexity heuristics, "this reads like AI" verdicts. That kind of analysis can be done responsibly only when you have a baseline of the student's own prior writing to compare against. Without that baseline, prose-style detectors trip false-positives on non-native English speakers, formal-register writers, and students who've been coached on academic prose. The tools that do this without baselines — and there are several — have been sued and reputationally damaged for it.

We're not in that business. Every signal Autotend Forensics surfaces has a file-structural basis — metadata, edit history, paste patterns, font and encoding signatures, embedded objects, timeline anomalies. The file structure is unaffected by who's writing or what language they think in; the bias mechanism that breaks prose-style detection doesn't apply here. If you want a deeper dive into why we made this call, see Why prose-style AI detection is biased.

If you teach a class and want history, bulk scanning, exportable integrity-report PDFs, or share-link controls for sending evidence to an academic-integrity committee, those are part of the paid forensics tier. The free single-document tool stays free forever, for anyone, with no signup.