| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bclavie 1230 days ago
	Parsing pdfs (and powerpoints) and breaking them into "askable" chunks is definitely something we've been looking into and are keen to roll out. If you'd like to talk more about your use case definitely feel free to chuck us an email on the "reach out" email on the page!

2 comments

giovannibonetti 1230 days ago

Since you are working with raw text, it shouldn't need too much effort. There are a bunch of open source tools to extract text from PDFs.

The hard part would be parsing tables and other layout-dependent semantics. You usually start with text coordinates (like HTML elements with absolute position) and have to work backwards from that. I worked for some years in a project for a client that was full of edge cases, because whenever the input PDF (from a government agency) would have a slight layout change the parser would break. It took multiple iterations to make it robust enough.

james-revisoai 1230 days ago

don't want to jump on your brigade, but at AnyQuestions.ai we specialise in quality PDF and transcript processing for AI-answer purposes (supporting AI answers with citations, just like your tool for documentation). This comes from 3 years working on tech to parse lecture slides correctly, identifying semantic areas (e.g. what is a title, how are bullet points connected...) turns out this is useful for semantic search and other purposes of embedding. You can verify this somewhat by viewing how transcripts get bunched if you upload a youtube video, or if you search for PDF results (bullet points will be resolved to what they refer to etc as appropriate)

Would love to chat with you if you're up for it - you can test demo run our tool and contact us through the interface