Font & encoding signals — Autotend Forensics Methodology

What font and encoding signals are

Beneath every visible glyph, a document records:

The font family (e.g. Calibri, Times New Roman) selected for each run.
A font fallback chain — what to render when the chosen font isn't installed.
The character encoding of the underlying bytes (UTF-8 vs. Windows-1252 vs. legacy code pages).
The embedded font hashes when the document carries embedded font files for portability.

Authors of one document tend to stick to one or two font families with one encoding. Pasted text from a different environment carries that environment's font and encoding fingerprint.

What the signals tell you

Font-fallback patterns. If a document declares "Helvetica Neue → Helvetica → Arial → sans-serif" in one run and just "Calibri" everywhere else, the diverging run was probably authored elsewhere. macOS / iWork tends to write longer fallback chains than Windows / Word.
Character-set mismatches. A document whose primary text is UTF-8 but contains a paragraph encoded in Windows-1252 (smart quotes that don't match the surrounding curly-quote style) has been assembled from two sources.
Embedded font hashes. Some authoring tools always embed Calibri Light; others never do. Mismatched embedded-font inventories are weakly correlated with multi-source documents.
Right-to-left direction tags. Pasted Arabic / Hebrew runs carry direction markers (bidi) that survive copy-paste. A document that occasionally toggles direction without surface reason was probably assembled.

What font / encoding signals cannot tell you

These signals are weak on their own. Many legitimate workflows trip them:

Students who write in Notion / Bear / Obsidian and paste into Word will have one fallback chain for the body and Word's default for headings.
Citations pasted from BibTeX / Zotero often carry their own formatting and font.
International students writing in English commonly switch language tags between paragraphs.

Use this channel as corroborating evidence after a stronger signal (metadata edit-time, paste detection, AI-self-disclosure) has surfaced something else.

What we surface

Autotend Forensics inventories:

Every distinct font family appearing in the body, with paragraph counts.
Distinct fallback chains and how many runs use each.
Encoding distribution (UTF-8 / Windows-1252 / other).
Embedded-font list and the bytes-per-font (a quick way to spot bloated font tables from automated tooling).
Per-paragraph language and direction tags.

Inspector view: every flagged run links back to the exact character offset so you can see the discontinuity in context.