Hacker News new | ask | show | jobs
by meander_water 5 hours ago
> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

4 comments

Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard
Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

Appreciate you sharing the results of your tests though!

I appreciate the feedback!
On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid.

Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.

Care to share more about your pi setup? I've recently started using it (after long-time Claude Code work) and was wondering how you'd achieve these long-running tasks. Do you allow it to spawn sub-agents? Thank you!
Yes, part of the reason I chose the one-shot test was really to test long-running tasks. A lot of people seem to be experimenting with this format, for example in the now trending loop-writing workflows. And really I am interested in diving into the murky waters of these novel workflows.
sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those.
Then maybe you should add that caveat emptor to the article?

You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.

Totally agree, a single one-shot prompt can't prove anything.