Hacker News new | ask | show | jobs
I made a zero cost browser-use tool – let AI click and type on webpages for you (github.com)
1 points by pdufour 5 hours ago
1 comments

Link: https://github.com/pdufour/browser-use-wasm

So one of the big constraints of browser-use models is that they require a server running your vision language model to handle the images and convert it to actions.

That means if for instance you are a site owner and you want to include a AI widget that lets users control the webpage you are on via AI (i.e. ask the page to fill out this form) you would need a complicated server setup running a VLM.

I decided to build something different. We have had WebGPU and client-side models for a while, so I decided to build a library that does the following:

[Live page (iframe)] ──► [SnapDOM screenshot] ──► [ShowUI VLA WASM worker] ──► [DOM action at [x, y]]

Essentially this creates a browser-use model that runs entirely in your browser (no servers). There are a couple of libraries that make this possible:

- wllama for instance allows you to run any gguf model, which means easy access to VLA model on HF (I found ShowUi-2b to be the best but I want to try Nvidia LocateAnything)

- snapdom - as mentioned, this renders your webpage to an svg which is then passed to the VLA

After creating the workflow with those libraries, the rest is cake (not).

Some difficulties I had and my solutions for them:

- Snapdom had 1px rendering differences due to the inconsistencies rendering html that used a system font within a foreignObject tag in a svg - the fix it to use fonts from a CDN which provide font metrics for leading values

- Image resizing - you have to do some resizing to fit everything into limited space - this involved many different resizing methodologies

- Accuracy - finding out what increased my accuracy was quite hard at first till I found some evals such as MiniWoB++ (a web interaction test suite)

- Multi-step planning - my half-baked solution is to let the LLM generate the multiple steps, but in order for it to be comprehensive I would need to capture page, generate, capture page, generate, etc in a loop. I haven't done that yet

I am very interested in the client side LLM space so let me know if you have any thoughts!