Hacker News new | ask | show | jobs
by jahewson 3264 days ago
PDFBox committer here, if you want even lower-level access to the page content stream, without anything 'clever' at all, check out the PDFGraphicsStreamEngine class, which is a superclass of the text extraction and rendering classes. Gives you access to the raw glyphs. You can override PageRenderer too, for visual debugging, e.g. render glyph bounding boxes. We have an interactive Swing PDFDebugger which does just that.

https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...

1 comments

Thanks for the guidance, I'll take a look.