ZIP-shape oddities, embedded objects, file-system path leaks, export-source fingerprints.
What structural signals are
DOCX, XLSX, PPTX, ODT — every modern office document is a ZIP archive with a defined internal directory layout. The ZIP shape itself carries a fingerprint:
- The order in which the editor wrote the files into the archive.
- Which optional parts are present (custom XML, settings, themes, embedded fonts).
- The compression levels the editor used per file.
- File-system path leaks in embedded objects.
- Export-source fingerprints — markers that identify the tool that produced the file, separately from the application name in metadata.
The same logical document produced by Microsoft Word vs. LibreOffice vs. Google Docs export vs. a Pandoc conversion has a different ZIP shape, even when the visible content is identical.
What the signals tell you
- Export-source mismatch. A document whose metadata claims it was produced by Microsoft Word but whose ZIP shape is the canonical Google Docs export pattern shows the file passed through a Docs export workflow that the metadata field doesn't reveal.
- Re-export markers. Many conversion tools (Pandoc, CloudConvert, AbiWord) leave distinctive trace files in the ZIP. Their presence tells you a third-party tool touched the document.
- Embedded-object path leaks. Embedded images and equations
sometimes carry the file-system path they were inserted from
(
C:\Users\X\Downloads\IMG_2349.jpg). This isn't directly suspicious but reveals the authoring environment. - Custom-XML residue. Some assignment-submission portals inject a custom XML part with a submission timestamp. Its presence or absence can corroborate the submission flow.
- Compression anomalies. Editors compress text parts and binary parts differently. A document with binary-only compression on a text part has been processed by an unusual tool.
What structural signals cannot tell you
Structural signals are about provenance, not authorship. They tell you what tools touched the file; they don't tell you whether a human or an AI wrote the content. The right uses:
- Corroborating metadata about which application authored the file.
- Spotting copy-via-conversion workflows that strip useful metadata.
- Identifying batch-generated files (a stack of submissions with identical ZIP shape suggests they were assembled by the same tool).
The wrong use is treating structural signals as a verdict on their own. A LibreOffice user, for example, will always trip the "non-Microsoft export" flag — that's not academic dishonesty.
What we surface
For every supported format, Autotend Forensics extracts:
- The exact ZIP entry sequence (per-file order, name, compressed size, uncompressed size, CRC).
- The detected export source (Word, Google Docs export, LibreOffice, Pandoc, etc.) with the specific fingerprint that drove the detection.
- File-system paths leaked in embedded objects, with their context.
- Whether the document carries known submission-portal residue (Turnitin, Canvas, Blackboard injection markers).
Reviewers can compare the export-source detection against the metadata-claimed application — a mismatch is the most useful signal in this channel.
Scan a document for structural now.
Free, browser-only, no signup. Autotend Forensics runs entirely in your browser.
Open Autotend Forensics →