Hacker News new | ask | show | jobs
by nolok 492 days ago
I have lots of customer files and I've looked around with all these AI tools for something, paid or self hosted or whatever, where I point it to a folder with xlsx and pdf and then I can query "Whats the end date or M Smith contract" or "How much does M Smith still owe" and I've been very disappointed by that, it's either very complicated, or they break down with non text based pdf, or...

It feels to me that if you need to provide schema and preprocess the data and this and that at the end all AI provide is a way to do some SQL in natural language, meaning yes it's better but it doesn't remove the actual pain point if you're a tech user.

Then again maybe I'm wrong, didn't find the right tool or didn't understand it.

Is what I'm looking for something that actually exists (and works, not just on simple cases)?

2 comments

I worked on this a bit 1-2 years ago. Back then, LLMs weren't really up to the task, but I found them OK for suggestions that a human double checks. Brings us to the Ironies of Automation though (human oversight of automation with a review process doesn't really work, it's a paper worth reading).

We tried several dedicated services for extracting structured data and factoids like that from documents: First Google Document AI, then a dedicated provider focusing solely on our niche. Back then, that gave the best results.

There wasn't enough budget to go deeper into this and we just reverted to doing it manually. But I think a really cool way to do this would be to make a user friendly UI where they can see suggestions and the text snippets they were extracted from as they skim through the document, with a simple way to modify and accept these. I think that'd work to scale the process quite a bit. Focusing the attention of the human at the relevant parts of the document basically.

Haven't worked on this space since then, but I'm pretty bearish on fully automated fact extraction. Getting stuff in contracts and invoices wrong is typically not acceptable. I think a solid human in the loop approach is probably still the way to go.

I'm not completely up to date but a few months ago Qwen2-VL (runnable locally) was able to perfectly read text from images. So I'd say you would still need to preprocess that folder to texts to get any reasonable speed for queries but after that if you feed the data to a LLM with long enough context it should just work. If on the other hand it's too much data and the LLM is required to use tools then it is indeed still too soon. But it is coming.