|
|
|
|
|
by caspper69
497 days ago
|
|
Ok, I will need to work on my reading comprehension skills. That being said, I thought the purpose of OCR was to take text from a non-digital source and make it digital. Why should we have to OCR something that exists already in a perfectly interchangeable digital format already? |
|
I'm with you in spirit, but in this specific context I think it's because the alternative would require the ~~LLM~~ Agent to be an HTML parser, or be bright enough to write themselves a Scrapy crawler. I suspect folks decided it's cheaper (by some metric) to just use the normal browser machinery to render 45MB worth of HTML, JS, CSS, Cloudflare Spooge, etc into a PNG and then rip the actual content out of that
I was also going to offer as a counterexample: PDF