Hacker News new | ask | show | jobs
by pierre 2305 days ago
pdf2json font name can be uncorrect sometime as it does only extract them based on a pre-set collection of fonts. I suggest using this fork that fix it :

https://github.com/AXATechLab/pdf2json

Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....