| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mattfrommars 315 days ago
	Looks great to automate workload for Windows desktop application. I'd love to understand more deeply how your application works, so the set of commands your backend send is click, scroll, screenshot. Does it send command to say type character into an input field? How is it able to pin point a text field from a screenshot? Is LLM reliable to pin point x and y to click on a field? Also, to have this run in a large scale, Does it become prohibitively expensive to run on daily basis on thousand of custom workflows? I assume this runs on the cloud.

1 comments

sgtwompwomp 315 days ago

Thanks! And yes, so our pathfinder agents utilize Sonnet 4's precise coordinate generation capabilities. You give it a screenshot, give it a task, and it can output exact coordinates of where to click on an input field, for example.

And yes we've found the computer use models are quite reliable.

Great questions on scale: the whole way we designed our engine is that in the happy path, we actually use very little LLMs. The agent runs deterministically, only checking at various critical spots if anomalies occurred (if it does, we fallback to computer use to take it home). If not, our system can complete an entire task end to end, on the order of less than $0.0001.

So it's a hybrid system at the end of the day. This results in really low costs at scale, as well as speed and reliability improvements (since in the happy path, we run exactly what has worked before).

link

mattfrommars 311 days ago

Thanks for the reply. I lost you in this part, Great questions on scale: the whole way we designed our engine is that in the happy path, we actually use very little LLMs. The agent runs deterministically, only checking at various critical spots if anomalies occurred (if it does, we fallback to computer use to take it home)

I assume you send screenshot to claude for nest action to take, how are you able to reduce this exact step by working deterministically? What is the is deterministic part and how you figure it out?

link

sgtwompwomp 310 days ago

So what I meant is this: When you run our Cyberdesk agent the first time, it runs with the computer use agent. But then once that’s complete, we cache every exact step it took to successfully complete that task (every click, type, scroll) and then simply replay that the next time.

But during that replayed action, we do bring in smaller LLMs to just keep in check to see if anything unexpected happened (like a popup). If so, we fall back to computer use to take it home.

Does that make sense? At the end of the day, our agent compiles down to Pyautogui, with smart fallback to the agent if needed.

link

mattfrommars 310 days ago

Hi,yes. It makes sense now. To cache the steps and reply. Very efficient strategy than to run the step each time using LLM

link