I don't think the bitter lesson is applies to ASTs.
From the Bitter Lesson:
"Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better."
Those models are taking advantage of inductive biases. Every model has them, including the massive language models. They are not the same as engineered features (such as SIFTs) or heuristics.
Using the AST is just another way of looking at the code already in your dataset. For the model to understand what it is writing, it needs to map the text sequences map to ASTs anyways. It can attempt to learn this, but the 12B model still makes illegal Python code so it clearly hasn't.
"the bitter lesson" is a very interesting, thank you! However, I wonder if AST vs. text analysis is fully comparable to the examples given in the post. Applying human concepts for chess, go, image processing, etc. failed over statistical methods, but I don't think AST vs. text is exactly the same argument. IMO, using an AST is simply a more accurate representation of a program and doesn't necessarily imply an attempt to bring in human intuition/concepts.
I mean, the AST doesn't help at all with comments which are potentially the most valuable part of the code to an AI like this. Formatting is also ignored by the AST but may play a role in understanding, just as it can for humans.
The model can clearly already generate large amounts of code with no syntax errors in one shot. It's probably better at that than I am, I always need to fix something after typing a bunch of code without calling the compiler. I think that instead of adding a bunch of language-specific AST stuff it would be far better to simply give the model the ability to iterate on its solution the way humans do, to fix any syntax errors or logic bugs discovered by the compiler or at runtime. That could potentially work in a generic way for any language. It seems like the obvious next step, though figuring out how to train it is not obvious.
From the Bitter Lesson:
"Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better."
Those models are taking advantage of inductive biases. Every model has them, including the massive language models. They are not the same as engineered features (such as SIFTs) or heuristics.
Using the AST is just another way of looking at the code already in your dataset. For the model to understand what it is writing, it needs to map the text sequences map to ASTs anyways. It can attempt to learn this, but the 12B model still makes illegal Python code so it clearly hasn't.