Hacker News new | ask | show | jobs
by teiferer 5 days ago
> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.

18 comments

Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.
If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.

> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.

Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!
Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.

I'm not suggesting anyone is doing anything, just stating the objective fact that it is definitely possible for closed-weight model developers, and would be super hard to detect outside of this limit scenario you posit, where it is provably impossible for the provider to have seen the benchmark before it was run (which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking).

To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Its not a limit scenario is my point: these models are evaluated constantly, new benchmarks both public and proprietary are in constant development, benchmarks are not always static either, they can often times be living benchmarks that update over time.

You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.

> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking

yes this is incredibly common. I'm not talking about hypothetical scenarios.

> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.

> This is...just incredibly conspiratorial and a bit silly.

Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.

They don't have control over measurement. Consider also it's easy to figure this out and it creates a scandal. Like I said, consider Llama 4 which a lot of people pointed out used a custom model in LMArena to inflate their scores; its never clear what the true underlying story for this, but regardless that model release spurred billions of dollars of spending on new talent and a complete gutting of that org.

These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.

Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.
Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.
Imagine unironically starting your comment with "Um" in 2026.
As opposed to your incredibly useful contribution to this thread, thanks!
You don't have to imagine!
ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.

You are literally describing a benchmark
100% agree on this! These new models best performance is always experienced in the first hour of communicating with them. If you have a specific problem with a clear goal in mind, then you have one hour to get the best out of any AI model. Personally, every time I took an AI suggestion, I walked through a wall sideways. AI is hands down a smart technology that throws dictionary vibes!
Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.

> students are evaluated by teachers with more knowledge and experience than them

This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration)

I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)

It certainly is true in physics and engineering that a PhD student at least half way through their PhD should know more than there supervisor about their topic (and usually much earlier). Even a Masters thesis project student should understand the intricacies of their project better than their supervisor. I'm speaking as someone who has supervised a significant number of both PhD and Masters students.
The original post said “in college”. It might be true for PhD candidates halfway through their program, but that’s like 0.5% of college students. The vast majority of students are leagues behind their instructors in domain knowledge.
Yes, I intentionally left out the next part of the quote about graduate school, since that seems more accurate. I was disputing only the part that I took to be pertaining to undergraduate education. The full quote is:

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

A grad student is evaluated by how well they are capable of following scientific procedures, communicated their results and have a sufficiently broad knowledge foundation. All that can easily be verified by a professor in a related field since they are very experienced in all those things. They don't actually need to be experts in the specific narrow topic the student has become the world expert in.
> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

> How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.

But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.

Yet human judgement isn’t subject to side effects like fluency and persuasiveness? It’s like everyone in this thread dismisses benchmarks and then…describes a crappy benchmark.

Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.

Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?

I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)

The only one I see that thinks it is claude other than claude itself is the GLM series.
I have screenshots of Deepseek V4 doing this too - in a non-Claude-Code harness.
Also MiMo...
Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.
> real life resists those kinds of measurements

no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.

Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.

"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.

zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.

Don't you think this applies to LLMs too?
> determine quantitatively forever whether Rust is a superior programming language to Go

Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.

So .. where can we read about the results?
ugghh, benchmarks?
Benchmarks about the superior programming language?

You mean benchmarks about the programming language that produce the fastest code?

That is not really the same.

There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

"Check your work for mistakes after the first draft" maybe :)
Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.
No, relative performance between Python and Java can absolutely be measured.
Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.
I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

I know how stupid that sounds but it's true.

Well what do they say... "If it sounds stupid but it works, then it's not stupid!"

How do you measure the performance of people? This is subjective and biased every time.
I have a couple projects that have completely stalled because none of the frontier models could advance any further with them - I'm going to give fable a try at them this coming weekend.

I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.

ymmv

Yes, words matter.

My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".

AV professionals always say "timecode" - timestamp is a programming term.

Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".

fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)
Addendum: Interestingly, it ended up taking me about the same amount of time - 8 hours or so - to hit the "vibe limit" with Fable. But in that amount of time I made about 5-10x as much progress. So my feelings are:

1. It's exponentially better

2. yet, somehow, hand coding still isn't dead, at least for me

How many $ do you guys spend when your session runs for 30min? What's the total budget?
I just have a regular Claude subscription and keep within its usage limits
But isn't running Claude models for 30min expensive? Or is Claude Code not expensive?

I use Cursor and if I ran Claude models for 30min I might exhaust my mobthly budget! Maybe it's an API billing issue though

It's included free with subscription plans until June 22. I get about 2 hours a day of usage through Claude Code until I hit my usage limit. I just use it for 2 hours then wait for the next day.
Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.
That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.

IMO comparing different models is like comparing songs or paintings or modern art.

There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.

Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.

You can also do benchmarks but how do you measure the output of those?

The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.

> IMO comparing different models is like comparing songs or paintings or modern art.

I don't think this is that subjective or vague.

There are a couple of crisp metrics that can be used to evaluate a model:

- given a prompt, does it finish a task (times X tasks)

- how much did it cost to finish the task

- how long did it took?

If all models are able to handle a class of tasks, they perform equally well.

If a model costs much more to finish a task, it is worse than other models.

If a model takes longer to finish a task, it is worse than other models.

The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.

"Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

Or just that it's so much cheaper that the cost/benefit ratio is better?

Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

> "Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.

And that says it all.

> Or just that it's so much cheaper that the cost/benefit ratio is better?

That too is another definition of quality, isn't it?

If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.

> Also "finish a task" is also subjective.

No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.

> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".

I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.

And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.

The first thing in the release page is benchmark results...

https://www.anthropic.com/news/claude-fable-5-mythos-5

The benchmarks are now the equivalents of SAT/ACT/other standardized exams for humans. They are directionally quite predictive, but with plenty of outcome variance on the margins
Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed
It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ
I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.

> These comparisons are all gut feelings.

https://simonwillison.net/about/#disclosures

"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

But I'm totally unbiased on my gut-feeling posts, trust me bro.

-- AI influencers.

Anthropic didn't give me early access to this model, shouldn't that bias me against it?
You kinda proved the point...
How?
If you're that easily biased then why trust your assessment?
Where did I say I was biased?
This isn't some random dipshit, this is Simon Willison[1]. He has a bit more cred than some "AI influencer".

[1]https://en.wikipedia.org/wiki/Simon_Willison