|
|
|
|
|
by willvarfar
2338 days ago
|
|
Really really interesting, hadn't seen pdfpig before! In the delicious pics of results I can see the bullets treated as one column and the paragraphs for each bullet point actually run together as single chunk of text? What do you think about tackling bullets and indents? |
|
My area of work on the project has been the core file-reading and file-creation stuff so I haven't had much of a chance to review the layout algorithm performance across documents.
Having been working on a purely rules-based approach in a private repository for a side project it seems like the algorithms the contributor has implemented get you a lot closer to the correct result than starting from rules alone but it definitely feels like adding some context-aware rules would get all the way there. I'm not sure whether they'd be in scope for the layout analysis project itself or someone could take the open-core and extend it, as I was attempting in my side project.