I don't think this is a good test. Every CS graduate student has to write a lambda-calculus parser. There must be thousands of implementations on the web. It really is not strange that GPT-4 can reproduce this.
Frankly, if ChatGPT is only good at reproducing commonly written code, I don’t think it will impact the profession much given that code reuse is already a thing and lots are distributed free on the internet.
Even if that was the case, it'd still be massively useful because it would allow code to be transposed between languages idiomatically. For example, certain things like game engines were only implemented in very messy C++ code. If GPT understands how these libraries work, could it recreate all these game engines as cleaned up, elegant Haskell code?
Bilingual LLMs are already excellent at translating human languages.
I imagine we'll find that GPT can translate between existing programming languages very well.
We could soon see a future where "Damn this useful paper/code is in x language. Have to wait for someone to port the code over to y language" is a thing of the past.
They are excellent. Not quite human level, but very very close. I was curious about Chinese-English translation capabilities of the latest crop of models and on more difficult texts a bilingual model like GLM-130B makes several errors per page while GPT-4 is down to probably just around one.
Interested to see how that plays out for programming languages.
A sizable portion of software devs ultimately work on something that, at its core, follows the basic CRUD pattern. The day-to-day stuff also involves a lot of boilerplate -- "public static void main" has paid a lot of mortgages over the years.
I sort of agree, but it still amazes me that it can get the code is correct, even though I asked for a highly specific style (i.e., use of recursion, representing "None" as null, the format of the JSON, making local functions, etc.). So, even thought it has never seen that exact implementation, it still assembles a working function that just works. If it was just mixing up different code it recalled from memory, it would likely have a bunch of silly errors here and there that I'd have to fix manually, but no, it just works. That's what impressed me.
Let's invent a better test then. What could be an example task that isn't widely available on the internet already? I'm having a hard time coming up with anything that isn't reproduced in many places.
Ha, I gave Bing Chat something from Advent of Code and it correctly identified that it came from Advent of Code (without that being anywhere in the prompt). It provided a solution, but given that it identified the source of the question I don't think it was a good test. As you say, maybe changing some values will help.
Give it Synacor Challenge, just the spec, and see if it can pull it off. Fewer of those solutions out there. It just went offline recently, but Aneurysm9 has preserved the problem spec, their binary, and the checksum of the codes for their binary to check against.
The program will, likely, need to be amended to get the last code (last 2 codes? been a while) so you can see how it would handle updating for the new requirements.