Hacker News new | ask | show | jobs
by kcorbitt 606 days ago
(author here) yes it often confidently declares success when it clearly hasn't performed the task, and should have enough information from the screenshots to know that. I'm somewhat surprised by this failure mode; 3.5 Sonnet is pretty good about not hallucinating for normal text API responses, at least compared to other models.
1 comments

I asked it to send a message in WhatsApp saying that "a robot sent this message," and it refused, because it didn't want to impersonate somebody else (which it wouldn't have).

Next, I asked it to find a specific group in WhatsApp. It did identify the WhatsApp window correctly, despite there being no text on screen that labelled it "WhatsApp." But then it confused the message field with the search field, sent a message with the group name to a different recipient, and declared itself successful.

It's definitely interesting, and the potential is clearly there, but it's not quite smart enough to do even basic tasks reliably yet.