| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mertunsall 362 days ago

In browser-use, we combine vision + browser extraction and we find that this gives the most reliable agent: https://github.com/browser-use/browser-use :)

We recently gave the model access to a file system so that it never forgets what it's supposed to do - we already have ton of users very happy with recent reliability updates!

We also have a beta workflow-use, which is basically what's mentioned in the comments here to "cache" a workflow: https://github.com/browser-use/workflow-use

Let us know what you guys think - we are shipping hard and fast!

1 comments

anerli 362 days ago

So there’s a very big difference in the sort of vision approach that browser-use does vs. what we do

browser-use is still strongly coupled to the DOM for interaction because of the set-of-marks approach it uses (for context - those little rainbow boxes you see around the elements). This means it’s very difficult to get it to reliably do interactions outside of straightforward click/type like drag and drop, interacting with canvas, etc.

Since we interact based purely on what we see on the screen using pixel coordinates, those sort of interactions are a lot more natural to us and perform much more reliably. If you don't believe me, I encourage you to try to get both Magnitude and browser-use to drag and drop cards on a Kanban board :)

Regardless, best of luck!

nikisweeting 361 days ago

In our experience the DOM-based interaction is more repeatable and performant than vision / xy based, but they each have their tradeoffs, as you said click-and-drag is harder when the source and target arent classic dom elements (e.g. canvas). We'll likely add x,y-based interaction as a fallback method at some point.