|
|
|
|
|
by bclavie
1230 days ago
|
|
Parsing pdfs (and powerpoints) and breaking them into "askable" chunks is definitely something we've been looking into and are keen to roll out. If you'd like to talk more about your use case definitely feel free to chuck us an email on the "reach out" email on the page! |
|
The hard part would be parsing tables and other layout-dependent semantics. You usually start with text coordinates (like HTML elements with absolute position) and have to work backwards from that. I worked for some years in a project for a client that was full of edge cases, because whenever the input PDF (from a government agency) would have a slight layout change the parser would break. It took multiple iterations to make it robust enough.