Hacker News new | ask | show | jobs
by allanrbo 731 days ago
Been thinking... Why is AI even needed for a thing like this? Aren't there assistive tech API's for screen readers for people who are blind, that could be used to scrape all text from all windows, without resorting to OCR? Or maybe Electron apps break all of that...
5 comments

Depending on how you mean "AI", I can see 2 uses:

* First, yes you can often use a11y APIs, but OCR is likely to be useful/needed for apps that don't play well with those for whatever reason.

* Once you have the text, I could potentially see value in AI/LLM fuzzy searching; it's easier to say "find something I saw about a new programming language yesterday" than "search for... oh what was the title of that page? Oh, I'll just search for 'language' and hope it was mentioned..."

There's probably no body of works more impressive to me than Karli Coss's work to capture data from the various devices & systems they use. The map of infrastructure shows off the architecture here, what it takes to read your own systems: https://beepb00p.xyz/myinfra.html . The overall blog post/digital gardening plot on the topic is probably https://beepb00p.xyz/sad-infra.html .

There's something remarkably smart about OCR as a failsafe that gets around all technical problems. Karlicoss's work shows the extensiveness of reading out data, and even that will have various limitations with what the datastores choose to encode. Simply building super-agency stop the actual agency directly afforded us real humans, by dealing with the screen as interface, has a certain elegance to it (in a world gone mad with difficult unyielding technologies).

we actually patented that :)

https://patents.google.com/patent/US8214367B2/en

"a context recorder that uses accessibility mechanisms to record context information derived independently of screen-images"

then again, we arguably patented much of what makes up recall.

because RAG at least in theory makes it really easy to parse out meaning from the data, that's the AI part
Assistive APIs require active participation from all software - and said software to use them properly, thus it creates as many obstacles as the number of programs (or actually, UI elements) on a computer.

AI-based OCR does not, there is only a single obstacle and that is the AI OCR itself.

With the former you need every developer who ever contributed/will contribute to everything that ends up on your screen to participate properly. With the latter you only need the developers of AI OCR to participate and then it can work with anything exposed on screen regardless of if the program uses (or even can use) assistive APIs or not.

While the latter is obviously much harder, once (if) it works you get consistent results for everything and even if the results are consistently mediocre, they're better than consistently absent.

IMO even aside for things like recall etc, assistive software will be taking advantage of AI and OCR in the future so it can work regardless of what the individual programs do.