Hacker News new | ask | show | jobs
by lstmemery 1768 days ago
I have to disagree with you here. In the Codex paper[1], they have two datasets that Codex got correct about 3% of the time. These are interview and code competition questions. From the paper:

"Indeed, a strong student who completes an introductory computer science course is expected to be able to solve a larger fraction of problems than Codex-12B."

This suggests to me that Codex really doesn't understand anything about the language beyond syntax. I have no doubt that future systems will improve on this benchmark, but they will likely take advantage of the AST and could use unit tests in a RL-like reward function.

[1] https://arxiv.org/abs/2107.03374

2 comments

> but they will likely take advantage of the AST

In the end, a more general approach with more compute, always wins over applying domain knowledge like taking advantage of the AST. This is called “the bitter lesson”. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

I don't think the bitter lesson is applies to ASTs.

From the Bitter Lesson:

"Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better."

Those models are taking advantage of inductive biases. Every model has them, including the massive language models. They are not the same as engineered features (such as SIFTs) or heuristics.

Using the AST is just another way of looking at the code already in your dataset. For the model to understand what it is writing, it needs to map the text sequences map to ASTs anyways. It can attempt to learn this, but the 12B model still makes illegal Python code so it clearly hasn't.

"the bitter lesson" is a very interesting, thank you! However, I wonder if AST vs. text analysis is fully comparable to the examples given in the post. Applying human concepts for chess, go, image processing, etc. failed over statistical methods, but I don't think AST vs. text is exactly the same argument. IMO, using an AST is simply a more accurate representation of a program and doesn't necessarily imply an attempt to bring in human intuition/concepts.
I mean, the AST doesn't help at all with comments which are potentially the most valuable part of the code to an AI like this. Formatting is also ignored by the AST but may play a role in understanding, just as it can for humans.

The model can clearly already generate large amounts of code with no syntax errors in one shot. It's probably better at that than I am, I always need to fix something after typing a bunch of code without calling the compiler. I think that instead of adding a bunch of language-specific AST stuff it would be far better to simply give the model the ability to iterate on its solution the way humans do, to fix any syntax errors or logic bugs discovered by the compiler or at runtime. That could potentially work in a generic way for any language. It seems like the obvious next step, though figuring out how to train it is not obvious.

12B, though. What about 1.2T?
You need to scale the amount of data to take advantage of the increase in parameters. I'm not sure where we would find another 100 GitHubs worth of data.