Hacker News new | ask | show | jobs
by celestialcheese 1124 days ago
If you’ve never built PDF or archive document parsing systems, you don’t know true pain.

I see it as incredible. Most PDFs that i see are basically just thin wrappers around image scans of documents that don’t exist anywhere anymore. Archives from estates, manuals, etc.

These techniques of using LLMs to clean ocr output is game changing because best in class before was human-in-the-loop systems that required huge amounts of rewriting to get useable output.

Now LLMs are unlocking for significantly cheaper previously difficult data sources for relatively cheap.