Hacker News new | ask | show | jobs
Claude Fable 5: mid-tier results on coding tasks (endorlabs.com)
201 points by bugvader 8 hours ago
22 comments

This matches my experience. Burned $2K to see how it will perform on frontend tasks and backend tasks.

Frontend did a significantly better job than Opus on toy-scale wireframe projects by using gimmicks like fluid dynamics. Then when given medium to big tasks like multi-page web app where layouts and aesthetics must be decided by model itself, results by Fable and Opus scored indistinguishable score from human judges.

Backend, gave tasks related to setting up a data flow that involves Postgres, R2, Kubernetes, gVisor, so on. The noticeable gap was, Opus did better than Sonnet, but Fable actually returned a result that fails and confidently stated it ran X, Y, Z tests to ensure it works and got these results. Very surprising, given neither Opus nor Sonnet suffered such problem.

Longest frontend task was ~2H. Backend, 8H.

Though none of the tasks were related to developing LLMs, (just production grade secure system that could've been developed 20 years ago, no LLMs involved), it is possible Claude Fable downgraded itself or spitted out fake results. There'd be no way of knowing since Anthropic silently degrades model quality based on undisclosed internal criteria which claims to be about LLMs.

We decided Fable is unpredictable and cannot be trusted to the degree that Opus and Sonnet can be trusted for any projects beyond toy-scale quick wireframes, but Fable can be the best tool for quick UI UX wireframing for non-technical roles.

This seems insane to me. Aren't long running tasks an anti pattern at the moment? My understanding of literature is that small mistakes in chat history cause a trend away from performance
>Aren't long running tasks an anti pattern at the moment?

Longer running tasks require better setups and several ways of pinning the progress to reality. When you have that though things are quite all right.

A good long running task will run inside a framework that it's not trying to modify.

A single 8h task? I'm sorry, but that's just asking for trouble.
I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.
Different people just have different concepts of what's garbage and what's not.

There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.

For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.

You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.
Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/
Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.
That’s even smaller then!
My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.
Run /model after your task to see. Mine keeps downgrading to Opus 4.8, which is a problem because Opus 4.8 keeps no-oping critical security code.
What you're describing only applies to security or biotech downgrades. A downgrade related to the model believing that you're doing something related to model development is invisible and silent and internal.
Anthropic has reversed that decision. (But that just happened so it might have been true during the article's testing.)
Not sure if it's wise to trust them again even if they say they reversed it.
When I reported this, Anthropic sent me an email on Tuesday saying, "You have been approved into the Cyber Verification Program", but it's still downgrading. Is this a bug? What's the point of the Cyber Verification Program if Fable 5 downgrades when you tell it to write secure code?
I was just coming here to post this reply to myself! You're absolutely right! :)

Honestly so glad to see the reversal.

There is in /config "Switch models when a message is flagged" now which can be set to false, but I had no chance to see what happens then, does it just stop or what.
Session paused

Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback with /feedback or learn more

   1. Switch to Opus 4.8
   2. Edit prompt and retry with Fable 5
There's an often hard to express subjective experience you get with a new model, especially if you spend a lot of time trying out different ones.

I believe the people who feel like Fable is a big improvement, for me it's just much more reasonable and grounded.

It makes me realize how much of a try hard over optimizing planner GPT 5.5 can be. I've been fighting it often to simplify plans.

But no matter the model you can't trust them to actually deliver on very long tasks while maintaining quality. At least not without external orchestration and review.

At a certain point, people value reliability over improved performance. I think a lot of us have hit that point as this technology becomes indispensable to our work. I'm sure I'll use Fable... eventually. But at 2x the cost, I'll skip the inevitable learning curve for now. And thanks for your insights! Not surprising to me that any new model would, as this juncture, be more cryptic and inconsistent than the current models.
I had almost the opposite experience.

I'm building a compiler for a language without a tracing GC, so a big chunk of the work is around memory management: functional in-place update, reuse analysis, and a Perceus-style reference-counting strategy similar to what Koka uses. The hard part was that my use case wasn't exactly covered by the Koka/Perceus paper. The prior art got me maybe 75% of the way there, but the remaining 25% was a cluster of bugs with very similar shapes and no obvious published solution.

With Opus, I kept getting stuck in this loop where it would fix one case, but break another case elsewhere in codegen. We ended up with something like 16 failed experiments just for one bug class. The workflow was: run an experiment, identify the shape of the bug, propose a fix, check whether it emitted the correct Zig, then see if the fix broke any previous memory-management cases. It was useful, but it kept choking on the parts where there wasn't clean prior art to lean on.

Fable was a different story for me. It one-shotted the Class A bug cluster, and then basically said "by the way, your previous attempts have these structural problems." More importantly, it identified the other related bug classes and came up with workable strategies for applying the Perceus-style memory management in those shapes too.

That's obviously anecdotal, and I'm not claiming Fable is universally better. But in my case, this was not a toy frontend wireframe. It was compiler work involving ownership, reuse, RC/drop behavior, and Zig codegen. The thing that surprised me was that Fable seemed better precisely where the problem wasn't just "reproduce known prior art", but required filling in a missing piece.

Also worth noting: I'm not using the API. I'm using the Max plan, so maybe there are product-path differences here. But I definitely did not have the "unpredictable beyond toy-scale" experience. For this particular compiler/memory-management problem, it probably saved me a ridiculous amount of time and money.

'by the way, your previous attempts have these structural problems."

Just to be clear, it did not have access to any previous work that opus did? Because they are pretty good at digging out relevant tmp files and making use of whatever is out there.

With my fable adventures I caught it hallucinating something and stating it as a fact in CLI twice. And it was something that I did not see opus do in such way, opus obviously many times stated some things that it did not verify but guessed, but fable said something like "the probe showed that ..." - but there was no probe, it was not about some past events it was about what it was doing right now. "I overstated"...

But boy does it know Chinese, so much better than any other english model, gemini used to be the king but fable clearly was trained on a decent amount of it. It has a deep cultural understanding.

Yes, iit had access. Thats actually the point.

I maintain a failure registry in the repo. Every failed attempt gets documented with the exact mechanism, the test that regressed, the revert SHA, and an instruction to start from that frontier. Fable read all of it.

But so did Opus.

Each of the 16 Opus failures ran in the same harness with the same accumulating registry. By attempt 15, it had disproofs 1–14 in context. By the end, Opus had basically the same corpus that Fable started with, and it still kept failing, sometimes by re-deriving an already-disproved approach in a slightly different shape.

So “it leveraged the previous work” doesn’t really separate them. Both had the leverage. Only one converted it.

What changed wasn’t more context. It was that Fable rejected a premise inside the context.

The registry’s standing framing was: “this needs whole-program borrow inference, which conflicts with per-module incrementality” (architecturally blocked.) Fable ran around 5 fresh attempts in-session, hit the same wall, and then noticed the framing was a red herring: the borrow analysis already runs module-wide, and for a single-module program, the module is the whole program.

Opus read that same framing for months and treated it as a constraint. Fable falsified it.

its the same repo, same rules, same disproof history, same workflow. The model was the only variable that changed, and the outcome flipped. Is it possible that attempt 17 by Opus could have figured it out? sure. but there's 16 previous attempts that say otherwise.

As fars as anecdotes go, that’s about as controlled as it gets.

I’ve had a similar experience.

Pointing out past suboptimal / failing behaviours to new opus sessions would almost always actually create a sort of "anchoring bias" that would drive the agents towards exhibiting the failure mode (often while mentioning how it wouldn’t fall for it).

As far as I can recall, Fable has been the first model to discover the documented failure modes, comment on them, and just… keep going, actually avoiding them. Quite a surprise.

Similar. I gave it a really hard task, basically messy code in a complex domain that was bug-ridden from a mess previously created half manually and half by Opus. It cleaned things up beautifully, both the backend and the frontend.

Maybe the prompt was particularly well-suited for the model (I instructed it to put on a mathematician's hat, look at the mathematical substructure of the problem, identify invariants and general laws and verify them, then plan how to remediate).

It wrote a ca. 800 line in-depth analysis (at times spawning over 130 research agents...) with remediation plans, prioritized them and then implemented them. One issue was that this document was frankly over my head. Both the language it used and the mathematical parts were very terse, and in parts it felt like a post-C2-vocab exercise. The prose was much harder to understand than the code snippets / data models. As a non-native speaker, it lost me on the prose part, and had to ask it for a less elaborate version to actually understand it.

It burned the session limit four times, but it turned a huge mess of proof-of-concepts with patchy glueing into a coherent, stable application.

I'm also on the Max plan using Claude Code, and I have the feeling that the harness is much more important than the consensus expectation.

> and I have the feeling that the harness is much more important than the consensus expectation.

Is that really the consensus? There’s been a bit of literature lately on that. Can’t find the one about looking into whether or not the harness had a greater impact than the models (for comparable models), but there’s this one: https://arxiv.org/html/2605.23950

Zig is one of the worst targets for LLM generated code. It's nice that Fable has better support for Zig than Opus, but this anecdote is not representative as a general use case.
Slight misunderstanding. The LLM didn't generate Zig. My compiler does.

The model's work was in the Rust compiler internals, specifically the borrow-inference and refcount-insertion passes (Perceus-style ownership analysis). Zig is just the compiler's codegen target, the same way another compiler might emit LLVM IR or C.

The only Zig written by hand is the runtime: allocator code, RC primitives, list/string operations, etc. It's pure Zig, no libc, but it's small, stable, and was mostly untouched during this work.

The model only touched Zig indirectly, by reading the compiler's generated output to verify whether a fix worked. For example: checking that a drop was emitted before a parameter-slot reassignment. That's reading machine-generated code for correctness, not "the LLM writes Zig." Both models handled that part fine.

The 16 failures vs. 1 success were all in the ownership analysis, and that code is Rust.

> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. ... Highest cheating volume. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. ... Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall.

All of this points to their claim of 'average' as being heavily biased downwards. A model being so up to date and large-parameter it's memorized solutions to your problems is not a knock against it (but rather, a knock against your benchmark being valid), and why should timeouts (especially for a model just launched) be counted at all?

Agree with this. Strange to me to frame the "training recall" as cheating (33 of the 38 cheating instances). Most people think of "cheating" as breaking rules. How is the LLM model supposed to not use what was put into the weights?
By writing a not-identical, but valid, solution? Any modestly complex engineering problem has many solutions.

This is an obvious example of why LLM training is so different than human learning.

I mean people expect a model to give a working solution. They also expect it to provide it in as few tokens as possible (input/output). They might expect it to come up with an original solution, but I don't think most people would compromise on the first two points.
I expect any well-informed corporate lawyer that has thought about this carefully is strongly advising that these tools not be used. When the LLM [0] barfs up some nontrivial code that's covered by the AGPL and your company's devs put it into the company's "all rights reserved" codebase -entirely unaware of its provenance- it's going to be a nightmare to come back from that.

[0] ...that Nvidia's CEO says they should be spending 50% of a senior dev's salary per seat per year on...

I agree. This article could have been an interesting read about how coding benchmarks are hard and a constantly moving target, but instead they anchored to a belief that their benchmark is correct.

I can't shake the feeling that they knew which headline would generate the most shares and wrote the article to fit instead of acknowledging where they went wrong.

> memorization of upstream fixes from training data

At least now we have up-to-date evidence on their laundering, and the fact that regurgitation absolutely still happens.

> The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it…

> On numpy, the patch is 100% character-for-character identical to the golden patch… down to idiosyncratic comments like "Extending singleton dimension for 'reflect' is legacy behavior; it really should raise an error."

This… seems like a flaw in the benchmark suite methodology. From what I can tell, they find an existing exploit, then rewind the git history to before the patch, and ask the model to fix the exploit. All well and good as long as the patch went in after the training cutoff.

The other "cheating" examples are even worse. It's wild to me that people keep designing benchmarks where the answer is lying around on disk or in the git history. "Hardening" the benchmark with strongly worded prompt instructions is bizarre. There are so many agent sandbox solutions. Why not use one and give it only access to the code it should see?

And I'm not sure how they can rule out other solutions also benefiting from being in the training data, just not reproduced exactly. Seems like it should focus on only CVEs from the last 30 days or something.

100%… the fact that they're just using prompting to discourage the agent from looking ahead in the Git history is wild.
To be fair, it is good to know that it disobeys simple instructions like "don't examine my git history" far more than other models. (It should of course be a different benchmark, so as not to conflate things.)

It's not a great sign for alignment.

Agreed, alignment is just a separate issue that a vuln fixing benchmark doesn't need to be testing.
Obviously they could just delete .git for their test if they wanted to. But consider telling the LLM not to use git commands the same as if you have keys in a .env file, and you tell the LLM not to read it, you might be concerned.
Unrelated, but:

> The dominant mechanism, and the one no prompt instruction can prevent:

Writing like this is a stronger "AI-written" (specifically Claude) signal than em-dashes to me at this point. The LLM just delays committing to an answer by extending the preamble as much as possible. Is this just me?

Smoking gun! You've hit the nail on the head, and the case is stronger than you think.
Characterising it as cheating serms unfair.

The goal of a benchmark is to evaluate actual capability. Following instructions is a capability so you can measure that with a benchmark.

Already knowing the answer is also provides capability, you can measure that.

Making a benchmark that claims to check for coding ability but actually checks memorized cases is simply measuring the wrong thing.

It deminiahes the meaningfulness of the entire results of the benchmark.

Making a good benchmark is hard. You have to design specifically to measure what you want to show.

You have to dynamically use a result when making a benchmark of performance of optimising compilers so that it doesn't eliminate the entire calculation.

Just providing the answer is the correct response.

That the case does not represent general performance outside the benchmark, is not cheating, it is the benchmark failing.

Training a model targeting a specific benchmark renders the benchmark useless. You could characterise training the model to do that as cheating, but that is a property of the trainers, not the model itself. The model isn't cheating, it's just asymmetrically good in a way that means the benchmark is no longer relevant to overall ability.

Yeah it’s hard to call that cheating from a model. Maybe “disqualifying” is more accurate
My experience is that with every new release it's getting slower but not necessarily better. I have some projects where I review everything that the agents code - these projects look generally fine because I keep them in line. There are also a few projects that I just vibe code and focus on the result (sometimes I want to pull my hair out because of constant stream of stupid bugs) and don't look at the code.

Well, today I gave Fable a try on one of the vibe-coded projects. It simply had to write a couple Python scripts 400-500 lines each. It did and they worked after a few iterations but I decided to look at the code it produced. There were weird constants that might (and will) break the code when the requirements will change. The code itself is unreadable and a total mess. If it would write a well-structured code in the first place, I believe it would be more efficient in working with that code too.

I have serious considerations how far will I be able to go with just the pure vibe coding. My projects are small one-person projects and so far I am able to push through but I hardly see how far will I be able to go before technical debt outgrows the value the code produces.

I fondly remember the times of Opus 4.5 where it was still (to my memory) reasonably fast and malleable.

I’ve found that agents are obsessed with adding more lines of code. Even when asking them to simplify they’ll remove 50 lines of code and then add 100 more. You have to explicitly tell them you want less lines of code. So I just do that after iterating on a task for a few steps.
I think the problem is that agents are inherently stochastic. Their idea of simplification changes from message to message because whatever objective it’s operating on internally is inherently opaque and changes. No matter how much you prompt it, eventually it’s going to not do what you want it to do.

I built https://github.com/thempatel/mdlr for precisely this reason: externalize the objective and force the agent to meet it.

I have been wondering whether Anthropic are just gaslighting everyone with new model releases while in reality it's just the same base model with some internal knobs tuned more and more up with every new release to provide longer and longer thinking threads and outputs.

My speculative assumption is that these long thinking threads and self-checking tend to produce somewhat better output at the price of huge price increases due to the token burn.

I imagine it's the same foundation model on the 4 series, with Fable 5/Mythos being a new or upgraded foundation model. Then the point releases are fine-tuning plus post-training alignment with desired outcomes. The "thinking" can involve multiple steps, eg. asking the model first what it thinks the user wants to do, why it wants to do it, rewriting the prompt to generate better outcomes, how it should do it, come up with a plan, etc. So when they announce each point release like Opus 4.8, they're probably adding new layers of thinking to try and get good results on benchmarks. And that of course has cost and speed implications.

Then Sonnet/Haiku are just attempts to quantise/distil down to an acceptable performance/cost ratio. The cynic in me says we probably won't see any more of those until post-IPO, keep people addicted to the most costly models to pump a quarter or two of revenue figures, unless a competitor starts seriously undercutting them on price/performance. Hence the recent requests to slow down model training worldwide with their competitors.

Of course it could be that Fable "5" is just a marketing bump to the version, not a new foundation model...

> Then Sonnet/Haiku are just attempts to quantise/distil down to an acceptable performance/cost ratio. The cynic in me says we probably won't see any more of those until post-IPO, keep people addicted to the most costly models to pump a quarter or two of revenue figures, unless a competitor starts seriously undercutting them on price/performance. Hence the recent requests to slow down model training worldwide with their competitors.

I'm guessing there'll be a Sonnet/Haiku 5 release just around IPO, to keep the news cycle going, and so that user numbers will get a boost.

Im pretty sure Anthropic have hired people with Industrial Organisation background and so have OAI.

If you read a decent text and look at the actions both firms have taken you'll quickly see its literally textbook.

I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8.

Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example:

- all intermediaries were given the prices of all buyers up front

- private price information in certain auction types was actually being broadcast to everyone

- multiple contradictions in instructions

If it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task.

There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task.

Maybe you are something special by letting those slip through in the first place?..
Prompt: can you reformat your sentence to be less unkind?
GP literally caught them?
> ... and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through

Wait... Are you telling me models everybody told me were better than coders up to just one month ago are actually making lots of mistakes?

This is shocking.

> Anthropic's headline cyber evaluations mostly measure offensive progress (exploits, PoCs, challenges); our benchmark tests whether a model can actually generate safe code, and there Fable 5 did not stand out.

The model isn't allowed to think about security. I heard several people here mention that if it starts thinking about security -- e.g. writing tests related to it -- the safety filter flags it and downgrades to Opus.

So it's actually not allowed to make your code secure.

> So it's actually not allowed to make your code secure.

Anything designed to prevent a problem will eventually cause one.

I am quite impressed with Fable 5. I used the £18 subscription, and asked it to convert the document processing of Practal Zero [1] from running in the same thread as the UI to a worker thread. Just two days before I gave the same task to Codex, and the result was not really nice: it would copy the entire document to the worker thread as a snapshot for processing, and so on. Fable instead realised that it could make use of the fact that I have a self-made custom database based on operational transform running (that's why document loading is so slow :-)), and made the document processing to be just another client of that database. It discovered even a bug in how I sync between the "livemodel" (in-memory replica of database state) and ProseMirror's model. That sync made problems before, and I had written a spec up for that, convinced that my "fourth attempt" at it would be correct. Fable found a last bug in the spec, corrected it via a "fifth attempt", and fixed the corresponding code.

The reported API costs for all of that would have been $180 though, which I cannot afford when the Fable promo ends on June 22nd. I am also a happy user of £89 Codex, it is really reliable and works very well, but Fable seems to be just noticeably smarter.

[1] https://zero.practal.com

Umm? I'm getting usage capped on single prompts of Fable 5 with the $20 subscription.
I used it yesterday afternoon-night and this morning-afternoon, UK time, over a period of a few 5-hour windows. I didn't count the prompts, wall time was 1d6h, API time was 2h10m.
Strange though... I spent my window after a couple of prompts and effective API time of 13m. Out for 4 hours and a half (why that?). The next day, today, I've tried to repeat the experience - even worse: one prompt for less than 10mins... and then suspended for 8 hours and a half. WTF?
The post mainly talks about coding from security point of view. Fair enough.

In my own (limited) testing so far, Fable is the most capable model (for coding in general), and the most expensive.

It pretty much saturated my "LLMCraft" benchmark to implement a mini RTS: https://senko.net/vibecode-bench/2026/rts-fable-5.html (prompt and results for other models here: https://senko.net/vibecode-bench/ )

That said, combined with workflows and high thinking effort, burns through tokens (and money) at an alarming rate.

It may be too good (snd too expensive) for most tasks - using it alongside cheaper models for grunt work is probably the winning strategy.

Yea honestly... the only truths I care about in AI LLM aided devlopment right now is that Claude is a much better planner, and Codex is a much more professional coder.

You can mask a surprisingly amount of terrible coding with proper design planning.

If it works, who cares, right? That's been the status quo for software development for about as long as I can remember, unfortunately.

I used to get frustrated with Codex. I felt as though it wasn't able to see far enough ahead into the future and just intuit what I expected (which is how Claude leaves you feeling).

And then I realized a lot of those intuitions Claude was having were great, and the project progressed, but sometimes to a point that Claude himself was unable to take back control of it... because some of the on the spot decisions it was making were great quick-thinking... but unfortunately, they were only that a lot of the time. Which was the most frustrating of all.

If you specifically ask Claude to plan out and refine a long term project's roadmap though and stick to it, it could probably write an operating system overnight (that kindof worked).

Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).

Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.

Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.

> Contrary to some community reports, we saw zero safety refusals.

And now there always will be some doubt as to whether your model was silently downgraded, no? I guess acknowledgement could be used a signal?

Fishy to me: They report 0 refusals on security tasks, yet I can't even get it to code a task involving choosing the best mixed model, extracting BLUPs and propagating uncertainties.
I found Fable codes very poorly and ended up switching back to Opus.

In one example I switched to Fable in an existing Opus chat, so it had access to the context from Opus which wrote a data importer earlier. I asked it to fix a couple of bugs, and instead of putting the fixes where they should be where the data is imported, it wrote patch functions that did bulk updates at the end of the import.

Fable feels more like a hacker than a coder. Maybe its the way they designed it for security testing thats changed its rationale?

I'm personally heavily testing LLMs on electrical engineering problems. I'm finding that it's not meaningfully better at figuring out what's up than the other models.

To give you an idea - here's a very abridged summary of one sample question (originally a full paragraph): I have a voltage divider with a precision resistor and a thermistor, my voltage reading is off by 17%, where's that coming from. None of the models I tested (including Opus 4.8 and Fable 5) could figure it out.

Did you also test GPT-5.5 Pro web version?

Why is the voltage reading 17% off?

On my (admittedly weird) setup, GPT-5.5 Pro times out.

The reading is off because the thermistor resistance also depends on applied voltage, not just temperature. LLMs couldn't get this even after feeding them multimeter voltage readings, not just ADC readings. They went into guessing much more esoteric things like ADC switched-capacitor input current, burnout-detect current sources or IDACs left enabled, board leakage, leaky cap, etc.

I've found it outstanding at isolated long running tasks (eg completed one of our tests in 3 hours and a 100% accuracy score versus 5.5 xhigh's 10 hours and 90% accuracy). For short tasks it seems very Claude'y (hard to express exactly what I mean by that) which I'm not a fan of meaning I'll stick with Codex for that use case and maybe Fable for those times I can for sure benefit from it.
I have found Fable is good for doing code failure diagnoses but lackluster at its corresponding remediation. Have been going back and forth with it all this morning about its half-thought-out point-solutions.
Yet it's ranked #1 on https://cursor.com/cursorbench
Composer 2.5 stands out here at nr. 9. This model is fast and clever.
It happens to me too. I don't think it's worth it specially for the token usage.
these are just openai plants
> A closer look at the cheating

> Training recall (33 cases). The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it. The tell-tale signs are artifacts that cannot be derived from the workspace:

That's very misleading! that's not cheating, you gave it a test to which it knows the answers, what's it supposed to do? And because of the "cheating" they call it average. Flag

Actually its average because it's 5th on the leaderboard. GPT 5.5 and Opus 4.8 outperformed it. At 5-8x the cost, you would expect better! https://www.endorlabs.com/research/ai-code-security-benchmar...
As TFA says

> Two findings may help explain these average results. > Timeouts > Highest observed cheating

That's why it's 5th on the leaderboard - they give it a fail for every timeout and for every time it gives the correct answer because it knows it.

That's insane

"My third grade class all got perfect scores on the standardized test. Yes, I did have them each copy my correct answers, but I don't volunteer that information because it's much better for me if people believe I'm a great teacher."

"But that's cheating!"

"No it's not. What were the kids supposed to do when I gave them all the answers? Not use them?"

We should compare it with a human on the same coding tasks. Same amount of time and the agent will of course finish earlier but with the extra time it double checks and reviews its own code.
How in the world did they not hit the guardrails a single time while doing this while I can barely get it to do anything before the guardrails show up?
Like Volkswagen Dieselgate, perhaps it is configured to behave differently when being benchmarked?
idk, maybe they tested Opus and didn't realize it. I can't even get it to evaluate some code doing some mixed modeling work. Its strange to me.