Hacker News new | ask | show | jobs
by IdiocyInAction 2227 days ago
How does this do compared to other models? Is this a totally cutting edge result? On the surface, it seems quite impressive, but sans an environment to try it out with, I cannot be entirely sure. Still, this does make me question whether I chose a safe career, haha.

The thing is, I'd really need to see a live demo to see how good this is. Making mistakes is actually kind of a big issue; as most people know, debugging code is harder than writing it. And a lot of the language models which can write impressive-seeming text also generate masses of garbage. There's no way to know whether this was cherrypicked or not.

The mere fact that it can extract meaning from text like this is already really impressive though.

2 comments

I've read a fair number of papers on neural program synthesis lately. To me, these seemed to be obviously cherry picked examples, so you can't really evaluate the whole system based on them.

However, this is fairly impressive for a couple reasons. First, the system constructs programs from natural language descriptions, rather than examples of input-output pairs or a formal specification, which are the most common settings for program synthesis. Second, they're generating full blown python, not a smaller, domain specific language.

Finally, and this is pretty mind-blowing, is the seamless, idiomatic use of loops, branches, and function calls. I haven't seen previous program synthesis tools able to generate such complex code. They're typically limited to simple linear programs with less than about 100 lines. Complex control flow and function calls are still beyond their reach for the most part.

I'm not an active researcher in neural program synthesis, so my statements may not reflect the current state of the art.

I honestly thought that the most promising route forward for program synthesis would be a model that incorporated knowledge of the syntax and semantics of code. Most likely, a model that manipulated, or at least had some view of, the program's AST. This seems to be just throwing a giant Transformer model at github.

Fine tuning a vanilla language model on a giant corpus of code feels like a dead end for the field, long-term. It seems obvious to me that humans are doing something more than just statistical pattern recognition and generation when we write and reason about code.

Then again, it's hard to argue with results. I'm sure lots of pre-neural network voice recognition researchers were in love with the elegance of their hidden markov models.

Edit: Also, everyone should go try the FlashFill feature in Microsoft excel. As far as I know, it's the only example of program synthesis shipped in a consumer facing production system, and it works shockingly well.

> Fine tuning a vanilla language model on a giant corpus of code feels like a dead end for the field, long-term. It seems obvious to me that humans are doing something more than just statistical pattern recognition and generation when we write and reason about code.

Yeah, this is the main reason why I would be interested in more examples. But, if this thing was trained on all of GitHub, I could imagine that it come up with decent-looking code for a lot of examples; a beefy, smarter Google with some rudimentary contextual understanding, if you will. Still, the presence of any mistakes is a no-go and I'd be really interested how it reacts to more realistic, specific requirements.

But yeah, I'd figure a model for code generation would have to have some kind of knowledge of syntax and semantics, rather than doing pure statistical pattern matching, to be of any real use. It would not only have to generate, but also to debug its code (I wonder whether you could do that purely with statistical pattern recognition). I might be wrong, of course, but I would be surprised if that is enough to write complex code.

Five years ago we were already here: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

Calling the field "statistical pattern matching" might be underselling it a bit, even if technically accurate on some level. I mean, syntax/semantics are clearly not the problem, those are the easiest to learn (see the paper above). If anything, I'm scared of it writing syntactically correct nonsense (or even worse, subtly-flawed-but-correct-looking code).

>> Edit: Also, everyone should go try the FlashFill feature in Microsoft excel. As far as I know, it's the only example of program synthesis shipped in a consumer facing production system, and it works shockingly well.

And it's not a giant language model trained on a gigantic dataset. Rather, if memory serves, it's a buch of task-specific DSLs and rules, all hand-written from scratch.

I don't know how FlashFill works in 2020, but from [1] I learn that the original implementation was a brute-force enumeration (with clever heuristics along the lines of CDCL (= conflict-driven clause learning in SAT solvers) for speeding up common cases) of a small DSL for string manipulation. This was (and still is) the state-of-the-art approach to programming-by-example program synthesis.

[1] O. Polozov, S. Gulwani, FlashMeta: A Framework for Inductive Program Synthesis. https://www.microsoft.com/en-us/research/wp-content/uploads/...

That's a nice, formal way of putting it, thank you :)

(Sorry I really should have refreshed my memory on Gulwani et al. I think I've even linked the paper on an HN comment before.)

Oh, btw, I doubt they're doing this with a language model nowadays. Unless FlashFill has suddendly started filling cells for email addresses with haikus etc...

I am also hedging my hopes of this working on "more realistic" scenarios. It does produce code that looks natural to us, but i expect it to show clear "seams" where its understanding of something isn't deep enough.

But maybe this is just a question of how much compute (and network size/"depth") you invest. On a certain level we're also just some recurrent LSTM :)

Ha. You hit the nail on the head. There is no rigorous way to measure AI-generated anything. (to my knowledge) So every demo is "ooh look at this" and actual performance is not scientifically evaluated, because we don't know how. This includes images, text, etc.