|
|
|
|
|
by UglyToad
2340 days ago
|
|
I had to check we hadn't worked for the same company! Yeah, text extraction and layout analysis from PDFs is a super interesting challenge and still relatively underdeveloped. I'd put table detection at about the hardest challenge in that field. One of the contributors to the PDF library I'm developing has been implementing some interesting algorithms for layout analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal... |
|
In the delicious pics of results I can see the bullets treated as one column and the paragraphs for each bullet point actually run together as single chunk of text?
What do you think about tackling bullets and indents?