|
|
|
|
|
by kcorbitt
606 days ago
|
|
(author here) yes it often confidently declares success when it clearly hasn't performed the task, and should have enough information from the screenshots to know that. I'm somewhat surprised by this failure mode; 3.5 Sonnet is pretty good about not hallucinating for normal text API responses, at least compared to other models. |
|
Next, I asked it to find a specific group in WhatsApp. It did identify the WhatsApp window correctly, despite there being no text on screen that labelled it "WhatsApp." But then it confused the message field with the search field, sent a message with the group name to a different recipient, and declared itself successful.
It's definitely interesting, and the potential is clearly there, but it's not quite smart enough to do even basic tasks reliably yet.