Hacker News new | ask | show | jobs
Show HN: An open-source Operator that can use computers (github.com)
9 points by theonlyt3 434 days ago
Hi HF, I'm Terrell, and we built an open-source app that lets developers create their own Operator with a Next.js/React front-end and a flask back-end. The purpose is to simplify spinning up virtual desktops (Xfce, VNC) and automate desktop-based interactions using computer use models like OpenAI’s

There are already various cool tools out there that allow you to build your own operator-like experience but they usually only automate web browser actions, or aren’t open sourced/cost a lot to get started. Spongecake allows you to automate desktop-based interactions, and is fully open sourced which will help:

- Developers who want to build their own computer use / operator experience - Developers who want to automate workflows in desktop applications with poor / no APIs (super common in industries like supply chain and healthcare) - Developers who want to automate workflows for enterprises with on-prem environments with constraints like VPNs, firewalls, etc (common in healthcare, finance)

Technical details: This is technically a web browser pointed at a backend server that 1) manages starting and running pre-configured docker containers, and 2) manages all communication with the computer use agent. [1] is handled by spinning up docker containers with appropriate ports to open up a VNC viewer (so you can view the desktop), an API server (to execute agent commands on the container), a marionette port (to help with scraping web pages), and socat (to help with port forwarding). [2] is handled by sending screenshots from the VM to the computer use agent, and then sending the appropriate actions (e.g., scroll, click) from the agent to the VM using the API server.

Some interesting technical challenges we ran into:

- Concurrency - We wanted it to be possible to spin up N agents at once to complete tasks in parallel (especially given how slow computer use agents are today). This introduced a ton of complexity with managing ports since the likelihood went up significantly that a port would be taken. - Scrolling issues - The model is really bad at knowing when to scroll, and will scroll a ton on very long pages. To address this, we spun up a Marionette server, and exposed a tool to the agent which will extract a website’s DOM. This way, instead of scrolling all the way to a bottom of a page - the agent can extract the website’s DOM and use that information to find the correct answer

What’s next? We're working on adding support in the UI to run this locally on your own machine, and to spin up other desktop environments like Windows and MacOS. We’ve also started working on integrating Anthropic’s computer use model as well. There’s a ton of other features we can build but wanted to put this out there first and see what others would want

Would really appreciate your thoughts, and feedback. It's been a blast working on this so far and hope others think it’s as neat as I do :)

Here’s the link to clone: https://github.com/aditya-nadkarni/spongecake

4 comments

Very cool. I was thinking of writing a script for automating some immigration-related forms, might give this a go. Any thing to bear in mind for form-filling?
Great question!

We actually included a basic form-filling example (data_entry_example.py) in our GitHub repo—definitely give it a spin and see how it goes.

One tip: filling out forms is currently a bit slow since each step runs sequentially. We're actively looking into concurrency improvements (for example, calculating multiple field interactions at once) to speed things up.

Excited to hear how it works for you—feel free to share any issues or feedback you run into!

Have you guys done any benchmarking to see which LLMs perform best?
No formal benchmarks yet—but just from our own tests, OpenAI's computer use model has generally done a better job than Anthropic's, especially at locating the right click targets and coordinates. We're definitely planning a more thorough comparison soon, though! Curious if anyone else has noticed differences in these computer use models? Would love to swap notes! :)
what's the most interesting use case you've seen so far?
Good question!

Probably the most surprising/interesting one I've seen is automating job applications. Essentially spinning up multiple concurrent agents to mass-apply across various job sites, automating things like clicks and form-fills

I thought it was interesting, but would love to hear if you've thought of other quirky or unexpected use cases! :)

Umm, interesting! Not much to add except will check this out later when I get home and share my thoughts :))
Nice, excited to hear what you think! All feedback welcome :)