| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by caspper69 498 days ago

Continuing on with my "old man yells at cloud" meme of late, here's my hot take:

So let me get this straight- we are going to train AI models to perform screen recognition of some kind (so it can ascertain layout and detect the "important" ui elements), and additionally ask that AI to OCR all text on the screen so it has some hope of being able to follow some natural language instructions (OCR being a task which, as a HN thread a day or two ago pointed out, AI is exceedingly bad at), and then we're going to be able to tell this non-deterministic prediction engine what we want to do with our software, and it's just going to do it?

Like Homer Simpson's button pressing birdie toy? :smackshead:

Why do I have reservations about letting a non-deterministic AI agent run my software?

Why not expose hooks in some common format for our software to perform common tasks? We could call it an "application programming interface". We might even insist on some kind of common data interchange format. I hear all the cool people are into EBCDIC nowadays.

Then we could build a robust and deterministic tool to automate our workflows. It could even pass structured data between unrelated applications in a secure manner. Then we could be sure that the AI Agent will hit the "save the world" button instead of the "kill all humans" button 100% of the time.

On a serious note, we should study various macro recording implementations, to at least have a baseline of what people have been successfully doing for 40+ odd years to automate their workflows, and then come up with an idea that doesn't involve investing in a new computer, gpu, and slowly boiling the oceans.

This reeks of a solution in search of a problem. And the solution has the added benefit of being inefficient and unreliable. But, people don't get billion dollar valuations for macro recorders.

Is this what they meant by "worse is better"?

Edit: and for the love of FSM, please do not expose any new automation APIs to the network.

4 comments

rglover 498 days ago

Thank you. My thoughts exactly. Specifically the "you want me to trust mission-critical business logic to a Frankenstein mess of non-deterministic 'agents'?!"

The scariest part is, as this advances, the level of disasters we're likely to see will at best be bankrupt corporations, and at worst, people being hurt/killed (depending on how carelessly these tools are integrated into mission critical systems).

link

svilen_dobrev 498 days ago

check https://news.ycombinator.com/item?id=42974429 from few days ago.. the OP was re-advertising OAUth, but another idea might be, that new kind of interfaces are needed - application agentic interfaces - standing in middle between APP(Programming) (too detailed) and AHI(Human) screen/forms (too human targeted). IMO.

link

caspper69 498 days ago

I propose the Open Agent Interface.

We can call it OpenAI.

I'll see myself out.

link

svieira 498 days ago

We could also call it the Open Agent Permissions Interface or OpenAPI for short.

link

heroprotagonist 497 days ago

*Open as in, many people can pay to use it. Not ALL, that would be ridiculous. And certainly not forever. Once it starts to work correctly, no more buzz will be needed to drive investment and you'll then be slowly cut off or price starved out of most functionality.

link

Terr_ 498 days ago

> Like Homer Simpson's button pressing birdie toy? :smackshead:

This comparison is especially apt, given that one of the main use-cases for LLMs is the same kind of... well, fraud: To give the illusion that you did the work of understanding or reviewing something, but actually just (smart-)phoning it in.

In one Apple iPhone advertisement, the famous actor is asked by their agent what they think of a script. They didn't read it, so they ask the LLM-assistant to sum it up in couple sentences... and then they tell their agent it sounds good.

link

caspper69 498 days ago

I think my quip about the toy flew over a lot of heads, so I appreciate that someone got it.

The reality is that most applications and websites don’t expose enough context about the what of what you’re actually doing for AIs to be able to meaningfully infer from natural language the steps required to complete a given task.

We humans are very good at filling in the blanks based on if we’re working in Photoshop or VS Code or Excel. We infer a lot of context from the specific files we’re working on or the particular client or even the files’ organization within the file system, or even what month or day it is.

I am skeptical that models will be able to replicate a complex workflow when there’s very little in the way of labels and UI controls even visible.

I know a weekly spreadsheet from a monthly and quarterly, etc. I know the minutiae about which options to use to generate the specific source reports, etc.

Workflows can be quite complex, no matter your role.

I mean I can just see it now: gift receipts being sent to the recipient before their birthday, internal draft proposals prematurely sent to clients, mixing up clients or commingling their data, overwriting or losing data; this whole thing just screams disaster. And I’m not even thinking about people involved with safety, or finance, or legal/regultory, or medical. Law enforcement?

This kind of thing can be done properly with well defined interfaces, common standards, and reasonable and prudent guardrails.

But it won’t be. It’ll be YOLOed on a paper thin training budget and it’ll be like your own little personal chaos monkey on ketamine.

link

Terr_ 498 days ago

> I am skeptical that models will be able to replicate a complex workflow when there’s very little in the way of labels and UI controls even visible.

Also, at least from the perspective of internal business software, a significant part of it is trying to get people to know what they're doing. There's a domain-model that's being taught at the same time, and it's institutionally-important that they are cognizant and aware of what they're agreeing to. Together this tends to lead to an arrangement of multiple screens, confirmation boxes, etc.

Many individuals instinctively dislike this, and it'll be their one of their first choices for "let my LLM assistant do it."

> I mean I can just see it now

Before these LLMs, I felt like Idiocracy had become politically prescient, but now it feels like I actually see a technology that could enable it.

link

caspper69 498 days ago

Brawndo is coming- it's got electrolytes and IT'S WHAT PLANTS CRAVE!

Life imitates art indeed.

I am sympathetic to wanting to automate complex workflows. Hell, I'm sympathetic to wanting to automate simple workflows. In fact, I bitch about the stupidity of the things I do at least once a week (no, you see, I take the numbers that show on this monitor, and I type them into a box on that monitor; why no cut & paste? faster to re-type the numbers; sigh).

But people provide context. Sure, an AI might tell you utility costs were up last quarter, but they won't know it was because of a water leak that went unnoticed and tripled the bill. Or it will tell you that wages were up, but not that it was because Bill from Operations had hernia surgery and we had to bring on a temp for 2 months. And it certainly won't tell you that Jim's back on the sauce, so we should probably begin putting out feelers for a new salesman.

So much of what business does is tracking metrics, yes, but the numbers never tell the whole story. There's always a backstory. Things that just can't be captured in raw data and hence can't be summarized by an AI. And AIs can't keep the ship sailing. Every small business has the guy/girl that does all the little things for everyone that absolutely holds the whole damn thing together. I'm not a BigCorp guy, but I imagine most departments are similar.

How about customer feedback? How can a model distill valuable (actionable) meaning from disparate communication mediums other than superficial high-level conclusions?

Expectations are just not realistic right now. There's going to be a lot of disappointment.

link

llm_trw 498 days ago

>So let me get this straight- we are going to train AI models to perform screen recognition of some kind (so it can ascertain layout and detect the "important" ui elements), and additionally ask that AI to OCR all text on the screen so it has some hope of being able to follow some natural language instructions (OCR being a task which, as a HN thread a day or two ago pointed out, AI is exceedingly bad at), and then we're going to be able to tell this non-deterministic prediction engine what we want to do with our software, and it's just going to do it?

AI is amazing at OCR, we've had tesseract ocr for 40 years and if you read the fine manual it has essentially a 0% error rate per character.

OCR on VLMs is terrible.

For some reason consistent x-heights between 10 to 30 pixels with guaranteed mono-column layout is not something venture capitalists get excited about, and as a result I'm not the founder of a unicorn.

link

caspper69 498 days ago

Ok, I will need to work on my reading comprehension skills.

That being said, I thought the purpose of OCR was to take text from a non-digital source and make it digital.

Why should we have to OCR something that exists already in a perfectly interchangeable digital format already?

link

mdaniel 498 days ago

> Why should we have to OCR something that exists already in a perfectly interchangeable digital format already?

I'm with you in spirit, but in this specific context I think it's because the alternative would require the ~~LLM~~ Agent to be an HTML parser, or be bright enough to write themselves a Scrapy crawler. I suspect folks decided it's cheaper (by some metric) to just use the normal browser machinery to render 45MB worth of HTML, JS, CSS, Cloudflare Spooge, etc into a PNG and then rip the actual content out of that

I was also going to offer as a counterexample: PDF

link

caspper69 498 days ago

Not everyone does their work in a web browser.

And even still, you don’t have to parse raw markup to grab properties from DOM elements. That could be handled by a browser plugin coupled with some some user guided training.

PDF is another beast entirely. I think there’s already a whole thread about that going on now. I’m going to zip my lips. I’m still waiting on Adobe to return my call from two years ago inquiring about the licensing costs of their parsing library for a small shop. Good thing I wasn’t relying on them to get that project done, and thank goodness for oss.

link