Methodology · PDF

PDF forensics

PDF documents — fewer authoring signals than DOCX, but still useful for production-tool and edit-time signals.

PDF forensics: fewer signals, still useful

PDF is the most common submission format after DOCX, but it carries substantially less authorship metadata than the OOXML / ODT formats. PDF was designed for fixed-layout distribution; preserving every editor field a Word document records isn't part of the spec.

That said, PDFs still expose enough to be useful:

  • Document Info dictionary — Title, Author, Subject, Keywords, Creator (the originating app), Producer (the PDF generator), CreationDate, ModDate.
  • XMP metadata — a richer XML-based metadata block when present; can include the full edit history if the PDF was produced from a Word/Pages source with metadata preservation enabled.
  • Body text extraction — pdf.js extracts the page text; Autotend Forensics runs the linguistic signal detectors on that.
  • Embedded streams — fonts, images, and (sometimes) source-document residue.

Signals available on PDF

  • Creator + Producer combination. Real authoring tools (Word, Pages, Google Docs export, LaTeX, InDesign) have characteristic Creator/Producer pairs. AI-content-export tools (ChatGPT's "download as PDF" path) leave distinctive fingerprints in both fields.
  • Linguistic signals. All the AI-assisted-writing detectors run against the extracted body text.
  • Structural signals. PDF compression patterns, stream- encoding choices, and font-embedding behavior identify the source tool independently of what Creator claims.
  • Timeline anomalies. PDFs produced via "Print to PDF" lose the original creation date; Autotend Forensics flags PDFs whose CreationDate is within minutes of submission.

Signals not available on PDF

PDF is missing or weakly carries:

  • Edit time / total time. Most PDF producers don't record this.
  • Revision count. Same.
  • Tracked changes. Almost never preserved in PDF output.
  • Comment threads (sometimes preserved as PDF annotations, but not in the structured form Word uses).
  • Font fallback chains. Fonts are usually subset-embedded; fallback chains are lost.

If a student submits a PDF when DOCX would have been an option, that's a relevant procedural observation — you have less visibility into authorship.

Common false-positive paths

  • LaTeX-authored PDFs have a distinctive Producer string (xdvipdfmx, pdfTeX, etc.) that may look unusual but is perfectly legitimate.
  • OS-level "Print to PDF" strips most useful metadata; a PDF produced this way looks like an AI-export PDF in some channels.
  • Mobile PDF exports (iOS Files, Android print) have characteristic fingerprints that aren't suspicious.

What to expect

A typical authored PDF scan surfaces:

  • 5–10 metadata fields, vs. ~25 on the equivalent DOCX.
  • Creator/Producer pair that maps to a known tool.
  • Linguistic signals scored on the extracted body text.
  • A structural fingerprint that corroborates (or contradicts) the Creator claim.

For high-stakes review, requesting the original DOCX (or authoring-tool source) gives you a substantially richer signal than the PDF alone.

Scan a PDF document now.

Free, browser-only, no signup. Autotend Forensics runs entirely in your browser.

Open Autotend Forensics →