Hacker News new | ask | show | jobs
by willvarfar 2338 days ago
Really really interesting, hadn't seen pdfpig before!

In the delicious pics of results I can see the bullets treated as one column and the paragraphs for each bullet point actually run together as single chunk of text?

What do you think about tackling bullets and indents?

1 comments

Thanks! I think there's definitely room for rules-based enhancements to the underlying algorithms.

My area of work on the project has been the core file-reading and file-creation stuff so I haven't had much of a chance to review the layout algorithm performance across documents.

Having been working on a purely rules-based approach in a private repository for a side project it seems like the algorithms the contributor has implemented get you a lot closer to the correct result than starting from rules alone but it definitely feels like adding some context-aware rules would get all the way there. I'm not sure whether they'd be in scope for the layout analysis project itself or someone could take the open-core and extend it, as I was attempting in my side project.