|
|
|
|
|
by felipefar
668 days ago
|
|
Nice job! I've been wanting to write a PDF parser for learning purposes, but have been put off by the quantity of files that open source PDF parsers have on their repos and the different tech that they need (image formats, compression formats, etc.). I'll probably settle for a reasonable ratio between PDFs supported/learning extracted from the project, so it's useful knowing that PDFs with JS are not very widely used. Also, I'm the developer of a reference management software, and have naturally been thinking about what it'd take to save in the PDF file metadata fields that are generally useful for advanced readers and academics: original publication dates, ISBNs, DOIs, edition, publisher, etc., instead of just author and title. |
|
Once you have parsing and writing of a simple PDF file going (sections 7.2, 7.3, 7.4, 7.5, 7.7), add in support for encryption (section 7.6). Now you are able to handle to at least parse and write nearly all PDF files.
Then implement all the things you need gradually For example:
* Need support for parsing or creating the contents of a page? -> sections 7.8, 8, and 9. Mind you, start out with only supporting the built-in PDF fonts for creating text and later add support for TrueType (easier) and OpenType (harder if you need to implement the font parser yourself).
* Need support for annotations? -> section 12.5
And so on.
If you just need to store the metadata in the PDF, you only need support for parsing and writing a PDF because this usually also entails that you can modify the PDF object tree which is needed for storing the metadata. However, if you need to store that metadata in a way that is usable for other PDF processors, you would need to store it as an XMP file and creating that is yet another deep dive if you don't have an XMP library available. See section 14.3.2 in the PDF spec for this (btw. the latest PDF spec is available at no cost at https://pdfa.org/resource/iso-32000-2/).