| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by weatherlight 48 days ago

I had almost the opposite experience.

I'm building a compiler for a language without a tracing GC, so a big chunk of the work is around memory management: functional in-place update, reuse analysis, and a Perceus-style reference-counting strategy similar to what Koka uses. The hard part was that my use case wasn't exactly covered by the Koka/Perceus paper. The prior art got me maybe 75% of the way there, but the remaining 25% was a cluster of bugs with very similar shapes and no obvious published solution.

With Opus, I kept getting stuck in this loop where it would fix one case, but break another case elsewhere in codegen. We ended up with something like 16 failed experiments just for one bug class. The workflow was: run an experiment, identify the shape of the bug, propose a fix, check whether it emitted the correct Zig, then see if the fix broke any previous memory-management cases. It was useful, but it kept choking on the parts where there wasn't clean prior art to lean on.

Fable was a different story for me. It one-shotted the Class A bug cluster, and then basically said "by the way, your previous attempts have these structural problems." More importantly, it identified the other related bug classes and came up with workable strategies for applying the Perceus-style memory management in those shapes too.

That's obviously anecdotal, and I'm not claiming Fable is universally better. But in my case, this was not a toy frontend wireframe. It was compiler work involving ownership, reuse, RC/drop behavior, and Zig codegen. The thing that surprised me was that Fable seemed better precisely where the problem wasn't just "reproduce known prior art", but required filling in a missing piece.

Also worth noting: I'm not using the API. I'm using the Max plan, so maybe there are product-path differences here. But I definitely did not have the "unpredictable beyond toy-scale" experience. For this particular compiler/memory-management problem, it probably saved me a ridiculous amount of time and money.

5 comments

comboy 48 days ago

'by the way, your previous attempts have these structural problems."

Just to be clear, it did not have access to any previous work that opus did? Because they are pretty good at digging out relevant tmp files and making use of whatever is out there.

With my fable adventures I caught it hallucinating something and stating it as a fact in CLI twice. And it was something that I did not see opus do in such way, opus obviously many times stated some things that it did not verify but guessed, but fable said something like "the probe showed that ..." - but there was no probe, it was not about some past events it was about what it was doing right now. "I overstated"...

But boy does it know Chinese, so much better than any other english model, gemini used to be the king but fable clearly was trained on a decent amount of it. It has a deep cultural understanding.

link

Al-Khwarizmi 47 days ago

If you have some spare time, I'd be interested in knowing what kind of questions you use to test models on understanding of Chinese culture.

link

comboy 45 days ago

I'm creating hanzirama.com

I generate explanations for characters and words like so: https://hanzirama.com/character/%E6%9D%A5#explain

But I don't want to mislead learners and want to provide some cultural depth, so I have a hole sophisticated pipeline, using multiple models to generate the explanation, then multiple models look for issues in the explanation, each issue goes through the panel of judges (basically trying to squash down any hallucinations), it's fixed and it goes through such cycles a few times over.

I've been at it for some months now, so I have dozens of different probes, that I needed to evaluate prompts and method changes. Plus on some items I generated so many explanations through different means that I can tell a lot about given model just by looking at one.

Plus I'm doing some statistics, so I see how e.g. when working as judges of issues some models correlate heavily with some others... Fun fact during some testing runs basically just testing providers I stumbled upon qwen introducing himself as made by Google. And also Anhropic's Sonnet saying that it was made by OpenAI :)

At this point all my evaluations frameworks and pipelines stuff is much bigger than the site itself. I'm having lots of fun though.

link

weatherlight 48 days ago

Yes, iit had access. Thats actually the point.

I maintain a failure registry in the repo. Every failed attempt gets documented with the exact mechanism, the test that regressed, the revert SHA, and an instruction to start from that frontier. Fable read all of it.

But so did Opus.

Each of the 16 Opus failures ran in the same harness with the same accumulating registry. By attempt 15, it had disproofs 1–14 in context. By the end, Opus had basically the same corpus that Fable started with, and it still kept failing, sometimes by re-deriving an already-disproved approach in a slightly different shape.

So “it leveraged the previous work” doesn’t really separate them. Both had the leverage. Only one converted it.

What changed wasn’t more context. It was that Fable rejected a premise inside the context.

The registry’s standing framing was: “this needs whole-program borrow inference, which conflicts with per-module incrementality” (architecturally blocked.) Fable ran around 5 fresh attempts in-session, hit the same wall, and then noticed the framing was a red herring: the borrow analysis already runs module-wide, and for a single-module program, the module is the whole program.

Opus read that same framing for months and treated it as a constraint. Fable falsified it.

its the same repo, same rules, same disproof history, same workflow. The model was the only variable that changed, and the outcome flipped. Is it possible that attempt 17 by Opus could have figured it out? sure. but there's 16 previous attempts that say otherwise.

As fars as anecdotes go, that’s about as controlled as it gets.

link

ElFitz 47 days ago

I’ve had a similar experience.

Pointing out past suboptimal / failing behaviours to new opus sessions would almost always actually create a sort of "anchoring bias" that would drive the agents towards exhibiting the failure mode (often while mentioning how it wouldn’t fall for it).

As far as I can recall, Fable has been the first model to discover the documented failure modes, comment on them, and just… keep going, actually avoiding them. Quite a surprise.

link

cmenge 48 days ago

Similar. I gave it a really hard task, basically messy code in a complex domain that was bug-ridden from a mess previously created half manually and half by Opus. It cleaned things up beautifully, both the backend and the frontend.

Maybe the prompt was particularly well-suited for the model (I instructed it to put on a mathematician's hat, look at the mathematical substructure of the problem, identify invariants and general laws and verify them, then plan how to remediate).

It wrote a ca. 800 line in-depth analysis (at times spawning over 130 research agents...) with remediation plans, prioritized them and then implemented them. One issue was that this document was frankly over my head. Both the language it used and the mathematical parts were very terse, and in parts it felt like a post-C2-vocab exercise. The prose was much harder to understand than the code snippets / data models. As a non-native speaker, it lost me on the prose part, and had to ask it for a less elaborate version to actually understand it.

It burned the session limit four times, but it turned a huge mess of proof-of-concepts with patchy glueing into a coherent, stable application.

I'm also on the Max plan using Claude Code, and I have the feeling that the harness is much more important than the consensus expectation.

link

ElFitz 47 days ago

> and I have the feeling that the harness is much more important than the consensus expectation.

Is that really the consensus? There’s been a bit of literature lately on that. Can’t find the one about looking into whether or not the harness had a greater impact than the models (for comparable models), but there’s this one: https://arxiv.org/html/2605.23950

link

selimthegrim 47 days ago

whoa, my university!

link

miroljub 48 days ago

Zig is one of the worst targets for LLM generated code. It's nice that Fable has better support for Zig than Opus, but this anecdote is not representative as a general use case.

link

queuebert 47 days ago

Why is that?

link

weatherlight 48 days ago

Slight misunderstanding. The LLM didn't generate Zig. My compiler does.

The model's work was in the Rust compiler internals, specifically the borrow-inference and refcount-insertion passes (Perceus-style ownership analysis). Zig is just the compiler's codegen target, the same way another compiler might emit LLVM IR or C.

The only Zig written by hand is the runtime: allocator code, RC primitives, list/string operations, etc. It's pure Zig, no libc, but it's small, stable, and was mostly untouched during this work.

The model only touched Zig indirectly, by reading the compiler's generated output to verify whether a fix worked. For example: checking that a drop was emitted before a parameter-slot reassignment. That's reading machine-generated code for correctness, not "the LLM writes Zig." Both models handled that part fine.

The 16 failures vs. 1 success were all in the ownership analysis, and that code is Rust.

link

discardable_dan 47 days ago

You should consider doing the hard work yourself here. I sat down and reasoned through a Perceus-style RC mechanism a few years ago, made difficult by the presence of one-shot delimited continuations, and actually sorting it all out was not hard. Handing the correct semantics to Claude will produce the correct results if you take the time to understand the actual work you are attempting.

link

59nadir 47 days ago

Do you have a docs page for your language, what is it called?

link