PDF forensics — Autotend Forensics Methodology

PDF documents — fewer authoring signals than DOCX, but still useful for production-tool and edit-time signals.

PDF forensics: fewer signals, still useful

PDF is the most common submission format after DOCX, but it carries substantially less authorship metadata than the OOXML / ODT formats. PDF was designed for fixed-layout distribution; preserving every editor field a Word document records isn't part of the spec.

That said, PDFs still expose enough to be useful:

Document Info dictionary — Title, Author, Subject, Keywords, Creator (the originating app), Producer (the PDF generator), CreationDate, ModDate.
XMP metadata — a richer XML-based metadata block when present; can include the full edit history if the PDF was produced from a Word/Pages source with metadata preservation enabled.
Body text extraction — pdf.js extracts the page text; Autotend Forensics runs the linguistic signal detectors on that.
Embedded streams — fonts, images, and (sometimes) source-document residue.

Signals available on PDF

Creator + Producer combination. Real authoring tools (Word, Pages, Google Docs export, LaTeX, InDesign) have characteristic Creator/Producer pairs. AI-content-export tools (ChatGPT's "download as PDF" path) leave distinctive fingerprints in both fields.
Linguistic signals. All the AI-assisted-writing detectors run against the extracted body text.
Structural signals. PDF compression patterns, stream- encoding choices, and font-embedding behavior identify the source tool independently of what Creator claims.
Timeline anomalies. PDFs produced via "Print to PDF" lose the original creation date; Autotend Forensics flags PDFs whose CreationDate is within minutes of submission.

Signals not available on PDF

PDF is missing or weakly carries:

Edit time / total time. Most PDF producers don't record this.
Revision count. Same.
Tracked changes. Almost never preserved in PDF output.
Comment threads (sometimes preserved as PDF annotations, but not in the structured form Word uses).
Font fallback chains. Fonts are usually subset-embedded; fallback chains are lost.

If a student submits a PDF when DOCX would have been an option, that's a relevant procedural observation — you have less visibility into authorship.

Common false-positive paths

LaTeX-authored PDFs have a distinctive Producer string (xdvipdfmx, pdfTeX, etc.) that may look unusual but is perfectly legitimate.
OS-level "Print to PDF" strips most useful metadata; a PDF produced this way looks like an AI-export PDF in some channels.
Mobile PDF exports (iOS Files, Android print) have characteristic fingerprints that aren't suspicious.

What to expect

A typical authored PDF scan surfaces:

5–10 metadata fields, vs. ~25 on the equivalent DOCX.
Creator/Producer pair that maps to a known tool.
Linguistic signals scored on the extracted body text.
A structural fingerprint that corroborates (or contradicts) the Creator claim.

For high-stakes review, requesting the original DOCX (or authoring-tool source) gives you a substantially richer signal than the PDF alone.