Hacker News new | ask | show | jobs
by mkl 1425 days ago
> As time has gone on, and we have encountered more and more PDF files with ever more unexpected deviations from the specification

Does anyone know of a collection of malformed PDF files? It would be useful for testing PDF processing programs.

5 comments

Technically not all of these are malformed (sometimes the document is well-formed ISO PDF but the software won't accept it), but this corpora has a dump of all PDFs that were reported problematic in many software including Ghostscript, PDF.js (Mozilla) and PDFium (Chromium): https://www.pdfa.org/a-new-stressful-pdf-corpus/

(note that the majority of them are relatively-harmless rendering issues but some PDFs here have caused crashes or even RCEs and process takeovers for certain malicious PDFs)

There are some here, as test files in the qpdf library: https://github.com/qpdf/qpdf/tree/main/qpdf/qtest/qpdf

(But still, note: A couple of months ago I wrote a low-level PDF parser—just parse the PDF file's bytes into PDF objects, nothing more—and fed it all the PDF files that happened to be present on my laptop, and ran into some files that (some) PDF viewers open, but even qpdf doesn't. I say "even" because qpdf is really good IMO.)

Artifex has a public suite of PDF files here:

http://git.ghostscript.com/?p=tests.git;a=tree;f=pdf;h=2ce4f...

They're not all malformed, and they're mostly used for snapshot testing, but they cover a wide range of corner cases.

One trick you can do is fuzz pdf your self by getting any PDF file and opening it using vi or vim. Then write over anything you see and save it. Crude but if all you need are some broken PDF files, that will do it.
Fuzzing sounds like a very good idea to employ right from the beginning when writing parsers for complicated file formats.
I wasn't able to readily find any collections, and searching for anything plus the keyword "pdf" returns links to articles written in pdf

That said, this GitHub topic may have some pointers: https://github.com/topics/malware-samples