Hacker News new | ask | show | jobs
by celestialcheese 1124 days ago
Claude 100k 1.3 blew me away.

Giving it a task of extracting a specific column of information, using just the table header column text, from a table inside a PDF, with text extracted using tesseract, no extra layers on top. (for those that haven't tried extracting tables with OCR, it's a non-trivial problem, and the output is a mess)

> 40k tokens in context, it performed at extracting the data, at 100% accuracy.

Changing the prompt to target a different column from the same table, worked perfectly as well. Changing a character in the table in the OCR context to test if it was somehow hallucinating, also accurately extracted the new data.

One of those "Jaw to the floor" moments for me.

Did the same task in GPT-4 (just limiting the context window to just 8k tokens), and it worked, but at ~4x more expensive, and without being able to feed it the whole document.

3 comments

Using LLMs with 100GB VRAM to convert PDFs to CSVs is truly depressing, but I am sure many companies will love it.

2023 office software already uses 1000x more ressources than 1990s'. I bet we are ready to do that again.

Not just PDFs with tables. It works on any semi-structured document with key-value pairs like invoices, purchase orders, receipts, tickets, forms, error messages, logs, etc.

The "Information Extraction from semistructured and unstructured documents" task is seeing a huge leap, just 3 years ago it was very tedious to train a model to solve a single use case. Now they all work.

But if you do make the effort to train a specialised model for a single document type, the narrow model surpasses GPT3.5 and 4.

Consulting companies are paying juniors > $150k per year to do this kind of thing. In some objective sense, it's absurd, but locally, it makes more sense to use an expensive gpu than an MBA class president. And in 10 years, everyone's phone will have that much compute anyway.
It's funny but React/Node/Electron apps will suddenly become minimalist once everyone and his brother start adding a neural model to his app that consumes 10GB of V/RAM.
You're missing the developer time. You no longer have to spend hours (or days, perhaps weeks depending on the sources) stringing together random libs, munging and cleaning data, testing, etc etc.
I agree, computers are cheapers than engineers.

But I wonder how much more productive our economies could be if everyone was taught programming the same way we teach reading & writing, and open standards were ubiquitous.

> wonder how much more productive our economies could be if everyone was taught programming

Prompt engineering is turning coding problems into language problems. It’s conceivable that humans writing code becomes artisanal in a century.

> humans writing code becomes artisanal in a century.

At the pace we’re moving at now we’re talking a few decades away at the most, well within most peoples’ career span. I feel sorry for any junior coder just entering the industry.

Coding problems have always been language problems
> Coding problems have always been language problems

Pedantically, sure. The field ChatGPT is most impactfully commoditizing is low-level coding. Instead of someone giving natural language instructions to a team of humans, they're increasingly able to give them to an LLM. It's an open question how far this can scale. But we may be near the zenith of the practicality of large-scale coding expertise.

If you’ve never built PDF or archive document parsing systems, you don’t know true pain.

I see it as incredible. Most PDFs that i see are basically just thin wrappers around image scans of documents that don’t exist anywhere anymore. Archives from estates, manuals, etc.

These techniques of using LLMs to clean ocr output is game changing because best in class before was human-in-the-loop systems that required huge amounts of rewriting to get useable output.

Now LLMs are unlocking for significantly cheaper previously difficult data sources for relatively cheap.

On youtube there are timer and stopwatch videos that have millions of views, people are streaming 1080p videos for something that can be implemented locally within 20 lines of code, but does it matter really, it won't make a dent on Google's revenue.

If LLMs are deployed in large enough scale, the convenience really could justify the cost.

we also had more secretaries and people who just retyped things all day in the 90's!
It's worth double for the increase in accuracy. Don't let me go to Amazon Mechanical poor souls Turk.

https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk

The better version of this is using this massive LLM to _create a program_ that can then extract the same data of similar PDFs. That way the high cost is incurred only once.
> text extracted using tesseract

You're saying 'the text' without normalizing the rows and columns (basically the tab, space or newline delimited text with sporadic lines per row) was all you needed to send? I still have to normalize my tables even for GPT-4, I guess because I have weird merged rows and columns that attempt to do grouping info on top of the table data itself.

exactly. Just sent raw tesseract output, no formatting or "fix the OCR text" step. So the data looked like:

``` col1col2col3\nrow label\tdatapoint1\tdatapoint2... ``` Very messy.

I don't think this is generalizable with the same 100% accuracy across any OCR output (they can be _really_ bad). I'm still planning on doing a first pass with a better Table OCR system like Textract, DocumentAI, PaddPaddle Table, etc which should improve accuracy.

That’s still super cool!

Yeah my use cases are in the really bad category - I’ve been building parsers for a while, and I’ve basically given up to manually stating rows of interest if present logic. Camelot got so close but I ended up building my own control layer to pdfminer.six to accommodate (I’d recommend Camelot if you’re still exploring). It absolutely sucks needing to be so specific out the gate, but at least the context rarely changes.

What is the source of these nasty docs? I am also working on a layer above pdfminer.six to parse tables. It seems like this task is never done. LLMs have had mixed results for me too. I am focused on documents containing invoices, income statements, etc from the real estate industry.

My email is in my profile if you want to reach out and compare notes!

better - you can do it copy pasting from pdf to gpt on your phone! https://twitter.com/swyx/status/1610247438958481408
Definitely tried that way too, it didn’t work - my tables are pretty dang dumb. Merged cells, confidence intervals, weird characters in the cell field that change based on the row values - messing up a simple regex test, it’s really a billion dollar company solution but I’m about to punt it to the moon because it’s never fully done.
What was the dollar cost to do this work? To iterate over a 40k context must be expensive.
~$0.45