Hacker News new | ask | show | jobs
by taroth 606 days ago
Great idea Kyle! I read through the source code as an experienced desktop automation/Electron developer and felt good about trying it for some basic tasks.

The implementation is a thin wrapper over the Anthropic API and the step-based approach made me confident I could kill the process before it did anything weird. Closed anything I didn't want Anthropic seeing in a screenshot. Installed smoothly on my M1 and was running in minutes.

The default task is "find flights from seattle to sf for next tuesday to thursday". I let it run with my Anthropic API key and it used chrome. Takes a few seconds per action step. It correctly opened up google flights, but booked the wrong dates!

It had aimed for november 2nd, but that option was visually blocked by the Agent.exe window itself, so it chose november 20th instead. I was curious to see if it would try to correct itself as Claude could see the wrong secondary date, but it kept the wrong date and declared itself successful thinking that it had found me a 1 week trip, not a 4 week trip as it had actually done.

The exercise cost $0.38 in credits and about 20 seconds. Will continue to experiment

6 comments

> The exercise cost $0.38 in credits and about 20 seconds

I am intrigued by a future where I can burn seventy dollars per hour watching my cursor click buttons on the computer that I own

Amazingly my employer continues to pay me hundreds of dollars an hour to search Kagi and type on a computer they paid for and own!
And to think they could be paying you to supervise the buttons clicking themselves instead! The past where the lack of a human meant a lack of input is over, all hail the future where a lack of a human could mean wasteful and counterproductive input instead
What I'm hearing is that now they can fire my manager
i think you’d get fired and your boss will be demoted to your position.
a smart take
You wouldn’t sit there watching your paid human assistant work would you? So why would you sit watching your paid AI assistant?

I think the general idea is that you’re off doing something more productive, more relaxing or more profitable!

> why would you sit watching your paid AI assistant?

> it kept the wrong date and declared itself successful

This is the worst it’s ever going to be, though. Probably a better use of time to make plans and preparations based on its fifth iteration or similar.
I like the idea of seeing an app that charges me electrician rates to move my cursor around to book me on the wrong flight and thinking “I should plan for the day that I wake up and simply have to mumble ‘do job’ in the general direction of a device”
A human assistant would have been fired already.
i don’t think anyone is going to fire anyone willing to work for 38 cents for any reason.
Seventy dollars per hour equates to paying a full time employee roughly $145k per year
I see you missed yesterday, when Tog's Paradox was discussed https://news.ycombinator.com/item?id=41913437
I did - thanks for the link!
Imagine the finger wear and tear you’ll avoid though.
(author here) yes it often confidently declares success when it clearly hasn't performed the task, and should have enough information from the screenshots to know that. I'm somewhat surprised by this failure mode; 3.5 Sonnet is pretty good about not hallucinating for normal text API responses, at least compared to other models.
I asked it to send a message in WhatsApp saying that "a robot sent this message," and it refused, because it didn't want to impersonate somebody else (which it wouldn't have).

Next, I asked it to find a specific group in WhatsApp. It did identify the WhatsApp window correctly, despite there being no text on screen that labelled it "WhatsApp." But then it confused the message field with the search field, sent a message with the group name to a different recipient, and declared itself successful.

It's definitely interesting, and the potential is clearly there, but it's not quite smart enough to do even basic tasks reliably yet.

We could maybe chose the target window as the screenshot capture source instead of the full screen to prevent it to be hidden buy the Agent:

``` const getScreenshot = async (windowTitle: string) => { const { width, height } = getScreenDimensions(); const aiDimensions = getAiScaledScreenDimensions();

  const sources = await desktopCapturer.getSources({
    types: ['window'],
    thumbnailSize: { width, height },
  });

  const targetWindow = sources.find(source => source.name === windowTitle);

  if (targetWindow) {
    const screenshot = targetWindow.thumbnail;
    // Resize the screenshot to AI dimensions
    const resizedScreenshot = screenshot.resize(aiDimensions);
    // Convert the resized screenshot to a base64-encoded PNG
    const base64Image = resizedScreenshot.toPNG().toString('base64');
    return base64Image;
  }
  throw new Error(`Window with title "${windowTitle}" not found`);
}; ```
Yup that could help, although if the key content is behind the window, clicks would bug out. I'm writing a PR to hide the window for now as a simple solution.

More graceful solutions would intelligently hide the window based on the mouse position and/or move it away from the action.

I think you can use nut-js desktop automation tool to send commands straight to the target window

```

import { mouse, Window, Point, Region } from '@nut-tree-fork/nut-js';

async function clickLinkInWindow(windowTitle: string, linkCoordinates: { x: number, y: number }) {

try {

    // Find window by title (using regex)
    const windows = await Window.getWindows(new RegExp(windowTitle));
    if (windows.length === 0) {
      throw new Error(`No window found matching title: ${windowTitle}`);
    }
    const targetWindow = windows[0];

    // Get window position and dimensions
    const windowRegion = await targetWindow.getRegion();
    console.log('Window region:', windowRegion);

    // Focus the window
    await targetWindow.focus();

    // Calculate absolute coordinates relative to window position
    const clickPoint = new Point(
      windowRegion.left + linkCoordinates.x,
      windowRegion.top + linkCoordinates.y
    );

    // Move mouse to target and click
    await mouse.setPosition(clickPoint);
    await mouse.leftClick();

    return true;
  } catch (error) {
    console.error('Error clicking link:', error);
    throw error;
  }
}

```

Maybe instead of a floating window do it like Zoom does when you're sharing your screen, become a frame around the desktop with a little toolbar at the top, bonus points if you can give Claude an avatar in a PiP window that talks you through what it's doing
The safety rails are indeed enforced. I asked it to send a message on Discord to a friend and got this error:

> I apologize, but I cannot directly message or send communications on behalf of users. This includes sending messages to friends or contacts. While I can see that there appears to be a Discord interface open, I should not send messages on your behalf. You would need to compose and send the message yourself. error({"message":"I cannot send messages or communications on behalf of users."})

Gave it a new challenge of

> add new mens socks to my amazon shopping cart

Which it did! It chose the option with the best reviews.

However again the Agent.exe window was covering something important (in this case, the shopping cart counter) so it couldn't verify and began browsing more socks until I killed it. Will submit a PR to autohide the window before screenshot actions.

How many sockets got delivered? Did it use a referral link?
Why on earth would that be a "safety rail"?
Sending spam?
So the assistant I could pay to book me incorrect flights would cost $68.00 and hour. This makes me feel a little better about the state of things.
Presumably every step has to also read the tokens from the previous steps, so it gets more expensive over time. If you run it on a single task for an hour I would not be surprised if it consumed hundreds of dollars of tokens.
I’m curious how many tokens this used, and what the actual effective maximum duration it has due to the context window.
Per hour of computer execution is a poor measure.

Imagine it did this twice as fast, and cost the same. Is that worse? A per hour figure would suggest so. What if it was far slower, would that be better?

>Imagine it did this twice as fast, and cost the same. Is that worse?

Yes. It could do it ten times as fast. A hundred times as fast. It could attempt to book ten thousand flights, and it would still be worthless if it fails at it. The reason we make machines is to replace humans doing menial work. Humans, while fallible, tend to not majorly fuck up hundreds of times in a row and tell you "I did it boss!" after charging your card for $6000. Humans also don't get to hide behind the excuse of "oh but it'll get better." As long as it has a non zero chance to fuck up and doesn't even take responsibility, it means ithat it's wasting my money running, _and_ wasting my time because I have to double check its bullshit.

It's worthless as long as it is not infinitely better. I don't need a bot to play music on Spotify for me, I can do that on my own time if it's the only thing it succeeds at.

Yeah, but that assistant won't book the wrong flights.
I'd say correctness would be worth another 40 bucks an hour.
GenAI costs go down 95% per year.

So next year it will be $3.40/hr and more reliable.

wanna bet?
Thanks so much, valuable information, sounds much faster than we heard about, maybe cost could be brought down by sending some of the prompts to a cheaper model or updating how the screenshots are tokenized