Methodology · Structural

How to scan a PDF for tampering — an instructor's guide

A PDF looks like a finished, opaque document — finished file, no edit history, no traces of how it was made. That's the surface. Underneath, a PDF carries enough structural signals that an instructor can usually answer "where did this come from?" in 30 seconds with the right tool. This page walks through what those signals are and how to read them.

The fast path

Drop the file into forensics.autotend.io. The report walks every field below and flags anomalies. Free, browser-only, no signup.

If you want to understand what the scanner is actually doing — which is what this page is for — keep reading.

What's inside a PDF

A PDF is a sequence of "objects" (dictionaries, streams, arrays) linked by a cross-reference table at the end of the file. The relevant top-level objects for forensics:

  • Info dictionary — the legacy metadata block. Carries /Title, /Author, /Subject, /Keywords, /Creator, /Producer, /CreationDate, /ModDate.
  • XMP metadata — a newer XML-based metadata block, parallel to Info. Sometimes carries different values than Info — which is itself a forensics signal.
  • Document catalog — points to the page tree, the metadata blocks, and any embedded forms / signatures / attachments.
  • Page tree — the pages themselves; each page carries content streams and resource dictionaries (fonts, images).
  • Cross-reference table (xref) — at the end of the file, indexes every object. PDFs that have been edited have multiple xref tables — each save adds a new one. Counting xref tables tells you how many times the file was saved.

The five highest-leverage signals

1. The Producer field

Covered in detail in the PDF Producer field guide. One-line summary: this field tells you what program made the PDF. Microsoft Word for Microsoft 365 vs Skia/PDF m120 Google Docs Renderer vs iText 7.2.0 are three very different stories.

2. CreationDate vs ModDate

Both fields exist in nearly every PDF. CreationDate is when the file was first written; ModDate is when it was last saved. For a PDF that was created once and never modified, they should be identical (or within a few seconds of each other for the first save).

When ModDate is significantly after CreationDate, the file was opened and re-saved at some point. That can mean:

  • The student opened it in Acrobat to fill in form fields (legitimate).
  • The student or someone else made content edits (worth investigating).
  • A converter passed through it (e.g., online "PDF to Word and back to PDF").

When CreationDate is missing but ModDate is present, the file likely came from a converter that strips Creation but preserves Mod. Programmatic PDF generators frequently do this.

3. xref table count

If you can read PDF internals (or a forensics tool that surfaces this), count the number of xref markers in the file. A PDF written once has one xref table. A PDF saved twice has two. A PDF that has been opened-saved-opened-saved across multiple sessions has the count to match.

For student submissions, 1 xref is the modal case. 2–3 is normal (one save, one annotation pass). 5+ on a student paper is unusual and worth a question — what tools did the file pass through?

4. Embedded font subsets vs full fonts

PDFs typically subset their fonts — embed only the glyphs actually used in the document. The font names appear as ABCDEF+Calibri where ABCDEF is a per-file random prefix that ensures the subset is unique.

Two PDFs that share an identical font subset prefix string have an identical font subset embedded — which essentially never happens for genuinely independent documents. If two student submissions show the same ABCDEF+Calibri subset prefix, the files share embedded resources, suggesting they were generated from the same source document.

This is a strong forensics signal that's invisible to non-specialist tools.

5. Content-stream extraction shape

The visible text of a PDF lives in content streams — drawing commands that say "place this run of characters at this position with this font." Two patterns are diagnostic:

  • Text extracts as a single contiguous run with consistent positioning. This is consistent with text-from-word-processor: the text was rendered by Word, Google Docs, or LaTeX and the positioning is uniform.
  • Text extracts as many small isolated runs with inconsistent positioning. This is consistent with text-from-image-OCR: an image was OCR'd and the OCR engine produced a per-word or per-character placement. Look for this pattern when you suspect a student took a photo of someone else's paper and OCR'd it.

A content-stream view (your forensics tool surfaces this) lets you tell these apart at a glance.

What to do when a PDF is anomalous

  1. Don't escalate on a single signal. Producer being unusual is not by itself proof. Multiple signals pointing the same way is what matters.
  2. Ask the student about their workflow. "What program did you use to make this PDF?" usually resolves it.
  3. Compare against the rest of the class. A class where 90% of PDFs come from Word and one comes from iText is worth a question; a class where everything comes from iText is just the courseware in use.
  4. Document your reasoning. If you do escalate, the structural signals are documentable evidence. Screenshot the relevant fields.

What scanning a PDF cannot tell you

  • Whether the visible content was written by a human. Document forensics is about file provenance, not prose authorship. Even a PDF that's structurally pristine can carry copy-pasted content.
  • The age of the content. A PDF with CreationDate: 2026-05-12 could be a freshly-generated PDF of writing the student did six months ago and just finalized today.
  • Whether the student submitted the right file. Forensics is silent on the question "is this the version I asked for?" — only structural-shape signals.

For the full structural-signals methodology, see the structural signals page. For format-specific PDF detail, see the PDF format methodology page.