| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by meander_water 5 hours ago

> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

4 comments

jameswhitford 4 hours ago

Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

link

wongarsu 4 hours ago

Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard

link

meander_water 4 hours ago

Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

Appreciate you sharing the results of your tests though!

link

jameswhitford 3 hours ago

I appreciate the feedback!

link

esperent 4 hours ago

On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid.

Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.

link

thunspa 30 minutes ago

Care to share more about your pi setup? I've recently started using it (after long-time Claude Code work) and was wondering how you'd achieve these long-running tasks. Do you allow it to spawn sub-agents? Thank you!

link

jameswhitford 3 hours ago

Yes, part of the reason I chose the one-shot test was really to test long-running tasks. A lot of people seem to be experimenting with this format, for example in the now trending loop-writing workflows. And really I am interested in diving into the murky waters of these novel workflows.

link

ritzaco 5 hours ago

sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those.

link

patates 4 hours ago

Then maybe you should add that caveat emptor to the article?

You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.

link

unliftedq 4 hours ago

Totally agree, a single one-shot prompt can't prove anything.

link