PDF documents — fewer authoring signals than DOCX, but still useful for production-tool and edit-time signals.
PDF forensics: fewer signals, still useful
PDF is the most common submission format after DOCX, but it carries substantially less authorship metadata than the OOXML / ODT formats. PDF was designed for fixed-layout distribution; preserving every editor field a Word document records isn't part of the spec.
That said, PDFs still expose enough to be useful:
- Document Info dictionary — Title, Author, Subject, Keywords, Creator (the originating app), Producer (the PDF generator), CreationDate, ModDate.
- XMP metadata — a richer XML-based metadata block when present; can include the full edit history if the PDF was produced from a Word/Pages source with metadata preservation enabled.
- Body text extraction — pdf.js extracts the page text; Autotend Forensics runs the linguistic signal detectors on that.
- Embedded streams — fonts, images, and (sometimes) source-document residue.
Signals available on PDF
- Creator + Producer combination. Real authoring tools (Word, Pages, Google Docs export, LaTeX, InDesign) have characteristic Creator/Producer pairs. AI-content-export tools (ChatGPT's "download as PDF" path) leave distinctive fingerprints in both fields.
- Linguistic signals. All the AI-assisted-writing detectors run against the extracted body text.
- Structural signals. PDF compression patterns, stream- encoding choices, and font-embedding behavior identify the source tool independently of what Creator claims.
- Timeline anomalies. PDFs produced via "Print to PDF" lose the original creation date; Autotend Forensics flags PDFs whose CreationDate is within minutes of submission.
Signals not available on PDF
PDF is missing or weakly carries:
- Edit time / total time. Most PDF producers don't record this.
- Revision count. Same.
- Tracked changes. Almost never preserved in PDF output.
- Comment threads (sometimes preserved as PDF annotations, but not in the structured form Word uses).
- Font fallback chains. Fonts are usually subset-embedded; fallback chains are lost.
If a student submits a PDF when DOCX would have been an option, that's a relevant procedural observation — you have less visibility into authorship.
Common false-positive paths
- LaTeX-authored PDFs have a distinctive Producer string
(
xdvipdfmx,pdfTeX, etc.) that may look unusual but is perfectly legitimate. - OS-level "Print to PDF" strips most useful metadata; a PDF produced this way looks like an AI-export PDF in some channels.
- Mobile PDF exports (iOS Files, Android print) have characteristic fingerprints that aren't suspicious.
What to expect
A typical authored PDF scan surfaces:
- 5–10 metadata fields, vs. ~25 on the equivalent DOCX.
- Creator/Producer pair that maps to a known tool.
- Linguistic signals scored on the extracted body text.
- A structural fingerprint that corroborates (or contradicts) the Creator claim.
For high-stakes review, requesting the original DOCX (or authoring-tool source) gives you a substantially richer signal than the PDF alone.
Scan a PDF document now.
Free, browser-only, no signup. Autotend Forensics runs entirely in your browser.
Open Autotend Forensics →