| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by llm_trw 497 days ago

>So let me get this straight- we are going to train AI models to perform screen recognition of some kind (so it can ascertain layout and detect the "important" ui elements), and additionally ask that AI to OCR all text on the screen so it has some hope of being able to follow some natural language instructions (OCR being a task which, as a HN thread a day or two ago pointed out, AI is exceedingly bad at), and then we're going to be able to tell this non-deterministic prediction engine what we want to do with our software, and it's just going to do it?

AI is amazing at OCR, we've had tesseract ocr for 40 years and if you read the fine manual it has essentially a 0% error rate per character.

OCR on VLMs is terrible.

For some reason consistent x-heights between 10 to 30 pixels with guaranteed mono-column layout is not something venture capitalists get excited about, and as a result I'm not the founder of a unicorn.

1 comments

caspper69 497 days ago

Ok, I will need to work on my reading comprehension skills.

That being said, I thought the purpose of OCR was to take text from a non-digital source and make it digital.

Why should we have to OCR something that exists already in a perfectly interchangeable digital format already?

link

mdaniel 497 days ago

> Why should we have to OCR something that exists already in a perfectly interchangeable digital format already?

I'm with you in spirit, but in this specific context I think it's because the alternative would require the ~~LLM~~ Agent to be an HTML parser, or be bright enough to write themselves a Scrapy crawler. I suspect folks decided it's cheaper (by some metric) to just use the normal browser machinery to render 45MB worth of HTML, JS, CSS, Cloudflare Spooge, etc into a PNG and then rip the actual content out of that

I was also going to offer as a counterexample: PDF

link

caspper69 497 days ago

Not everyone does their work in a web browser.

And even still, you don’t have to parse raw markup to grab properties from DOM elements. That could be handled by a browser plugin coupled with some some user guided training.

PDF is another beast entirely. I think there’s already a whole thread about that going on now. I’m going to zip my lips. I’m still waiting on Adobe to return my call from two years ago inquiring about the licensing costs of their parsing library for a small shop. Good thing I wasn’t relying on them to get that project done, and thank goodness for oss.

link