Hacker News new | ask | show | jobs
by aesthesia 63 days ago
I'm not totally convinced by this:

> It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.

One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality, and architectures that let you train on more data are better because they learn better underlying world models. Knowledge transfers: LLMs are good at writing code partly because they've seen a lot of code, but also because they understand (at least to some extent) the relationship between that code and the rest of the world. Constraining a model's output structure also constrains the data that is available to train it. So the big question is whether you can actually meaningfully scale training with these kinds of strictly structured outputs.

2 comments

At the same time treating everything as tokens and next word prediction will never produce any real understanding like what humans do when they learn how to program. The bitter lesson is an admission that we still have no clue what is at the core of human learning and reasoning so we have to brute force it with tons of data generated by humans. I also don't know if expert systems and ML techniques like feature extraction are really any worse in practice or if we just didn't have enough engineering resources or a proper way to organize and scale their development. They seemed to work quite well in a lot of cases with more predictable results and several orders of magnitude less compute. And LLMs still suffer the long-tail problem despite their insane amounts of data.

If we're at the end of the data and most new data is now produced by LLMs with little human oversight, where do we go? Seems like figuring out ways to mix LLMS with more structured models that can reliably handle important classes of problems is the next logical step. In a way that is what programming languages and frameworks/libraries are doing, but they've massively disincentivized work on those by claiming that LLMS will do everything.

The chess example is a good one, it's effectively solved so why shouldn't an LLM have a submodule that it can use to play chess and save some energy.

Author here - thanks for engaging.

> One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality

Completely agree. It might have not come across, but what I'm pointing out in the post is that the data as it is currently encoded in the models is needlessly lossy. Tokens do not reveal all the information we have at our disposal. In natural language, that's fine, because it's quite loose in structure.

But if our domain is heavily structured (like modern programming languages are), why reveal only snippets of linearised syntax of that structure to the model? Why not reveal the full structure we have at our disposal?

> and architectures that let you train on more data are better because they learn better underlying world models.

By this argument, wouldn't we conclude that training on chess using the game structure wouldn't work either, since that'd be a model that uses less data?

Less data is the point, isn't it?