PDF Parsing: A Battle Against the Spec

2025-08-04

Parsing a PDF seems straightforward: find the version header, cross-reference table, object offsets, and finally build the catalog dictionary. Reality, however, is brutal. The PDF specification is not a hard and fast rule; real-world files are full of non-compliant situations, such as incorrect `startxref` pointer locations, garbage data at the beginning of the file, and malformed cross-reference tables. The author, by analyzing a large number of real PDF files, reveals these problems and points out that existing PDF viewers work because they handle non-compliant situations. This article explains the challenges of PDF parsing in an easy-to-understand way and provides valuable experience for developers.

Read more