Hacker News new | ask | show | jobs
by radarsat1 101 days ago
> Is it speed?

> Is it that you can backprop through this computation? Do you do so?

With respect, I feel that you may not have read the article.

> Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.

and,

> By storing points across nested convex hulls, this yields a decoding cost of O(k+log⁔ n).

and,

> Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.

So yes, and yes.

> Where are the benchmarks?

Not clear what they should benchmark it against. They do compare speed to a normal KV Cache. As for performance.. if it's actually executing a Sudoku solver with a 100% success rate, it seems pretty trivial to find any model doing < 100% success rate. Sure, it would be nice to see the data here, agree with you there.

Personally I think it would be really interesting to see if this method can be combined with a normal model MoE-style. It is likely possible, the router module should pick up quite quickly that it predicts the right tokens for some subset of problems deterministically. I like the idea of embed all sorts of general solvers directly into the model, like a prolog solver for example. In fact it never would have occurred to me to just go straight for WASM, pretty interesting choice to directly embed a VM. But it makes me wonder what "smaller" interpreters could be useful in this context.

3 comments

I read the article and had the same question. It's written in such a way that it feels like it's answering these questions without actually doing so.

The right thing to benchmark against isn't a regular transformer, it's a transformer that writes programs that are then interpreted. They have a little visual demo where it looks faster but only because they make Python absurdly slow, and it's clearly not meant to be a real benchmark.

I spent the whole article thinking, wow, cool, but also ... how is this better than an LLM steering a regular computer? The closest we get is a statement about the need to "internalize what computation is" which doesn't say anything to me.

Fundamentally, running actual instructions on a real CPU is always going to be faster than running them via a neural network. So the interesting part is where they say you can backprop through it, but, ok, backprop is for cases where we don't know how to encode a function using strict logic. Why would you try and backprop through a Sudoku solver? It's probably my imagination is just limited but I could have used more on that.

Benchmark it against a fast Python interpreter optimized for AI tool calling, like Monty: https://github.com/pydantic/monty
Did you read the post you are responding to? It says:

> What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?

The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?"

There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here.

> There are no details about training

my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.

ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.

This is my interpretation as well.

EDIT: Actually, they do make this clear(ish) at the very end of the article, technically. But there is a huge amount of vagueness and IMO outright misleading / deliberately deceptive stuff early on (e.g. about potential differentiability of their approach, even though they admit later they aren't sure if the differentiable approach can actually work for what they are doing). It is hard to tell what they are actually claiming unless you read this autistically / like a lawyer, but that's likely due to a lack of human editing and too much AI assistance.