|
|
|
|
|
by robinhowlett
2293 days ago
|
|
Thanks for the links - agree about the (x,y,text) callout but other metadata like font size can be useful too. Regexes have limitations but I was able them to leverage them sufficiently for PDFs from a single source. I parsed over 1 million PDFs that had a fairly complex layout using Apache PDFBox and wrote about it here: https://www.robinhowlett.com/blog/2019/11/29/parsing-structu... |
|
[0] https://www.thoroughbreddailynews.com/getting-from-cease-and...