Hacker News new | ask | show | jobs
by boxed 954 days ago
> We designed a test for this theory: we reran the benchmarks without showing the models the instructions to each exercise. Instead, we just told them that they were Exercism exercises, and gave them the exercise names and function stubs.

This summarizes all my skepticism agains the AI field. Pretty clear that they aren't solving the problem, they have them memorized.

13 comments

Memorization often gets a bad rap as the underachiever's shortcut. However, it's a fundamental component of any learning process! Our ability to reason, solve problems, and innovate is all built upon a foundation of memorized information. In fact, it's precisely the reason humans have thrived for so long; we were able to memorize and pass down knowledge culturally long before the written word, not because we were 100 times smarter than our nearest cousins. Without memorization, be it in our brains or AI algorithms, there's no foundation to build upon for higher reasoning.
It's hard to decide for me without seeing the data. Even if you don't know the exact exercise, seeing the title and the function name/parameters is often enough for me to guess what the challenge is. I checked the public questions on exercism and almost all of those (that I spot checked) that contained the function name were extremely obvious. Knowing it's a programming challenge would also improve my guessing chances.

For example the function stubs I can find are "value_of_card(<card>)" in exercise "Black Jack", or "generate_seat_letters(<number>)" in exercise "Plane Tickets". I think I could guess those without seeing the rest of the question.

You can call it whatever you want, all I know is I used to write programs in lines of code, then blocks of code at a time, spit out by LLMs

Using GPT-4 Turbo yesterday, I feel like I'm moving to pages of code at a time now.

Taking the ideas in my head and turning them into reality is so easy now

So how can it solve novel problems? Internet does not have all combinations for every possible task with any random programming language, library or constraints. It can even solve problems with non-existing programming languages and libraries if you describe them, if that's just memorization then I don't know what it isn't.
If that's your takeaway from this then you really missed the point. The implication here is that gpt-4 to gpt-4-turbo represents a leap away from memorization and toward better reasoning with a more complete world model.
"They memorized all the problems" is not what was found here and still a wrong overcorrection.
“Gpt-4 has more problems memorized than gpt-4 turbo” was exactly what was found here.

That doesn’t mean it’s only able to solve problems in its training set (tho it’s much better at that obviously.)

If you are shown only the title of a coding problem and the site name where it's from, and you manage to solve it you are showing that you either cheated or knew the answer.
On the contrary, it could mean you were, to some percentage of success, able to guess what problem is, and then, to some multiplier percentage of success, solve it.

The key is, can you guess the problem from the title and the function name? I'd argue, sure, at least half the time?, why not...

I mean sure, it memorized some of the answers. I'm not denying that. Clearly, it didn't memorize all of them.
When people say "oh look how amazing, it can solve programming problems!" when in fact it has only seen the models CHEAT, is an enormous problem.

For cases where finding the answer it's perfectly fine, but it's not fine for claims that it can code. There's a huge difference.

It can generate never-before-seen strings of comprehensible language. It can react to the inherent logic embedded in words and text and provide a brute forced version of what a human could. That it can “solve” a problem only through “cheating” is an anthropomorphism that betrays the magic that is evident to anyone who has used these things.
I've seen it code on completely novel tasks, so I'm not sure what you're suggesting here. The model can unquestionably code.
Almost 2024 and people still can't accept that LLM can code...
Of course they can't. And self-driving cars also don't exist, it's like 10 years away at best.
Okay... Funny how forcing it to not CHEAT did not increase apparent ability.

It can code and it has memorized some coding questions are not mutually exclusive.

Though this is exactly what happened. The initial test was ran on a model that "Cheated" (aka has memorized the answers). The second test was run on a model that didn't "Cheat" as much, yet still got only 2% less score. So, the question is not resolved really. How much did the first model cheat, and how much did the second? If the second model "cheats" less, then it wins.

Also, I don't understand your obsession with the word cheating. If you have solved a problem before on a different website and solve it again, did you cheat? Or did you just use your brain to store the solution for later?

> Okay... Funny how forcing it to not CHEAT did not increase apparent ability.

The article did the opposite. It forced the models to cheat to solve the problems. Which it did happily. It should have stated "there is no actual problem to solve here, you must supply a problem for me to solve".

> It can code and it has memorized some coding questions are not mutually exclusive

This I will give you. Many humans try to cheat at basic math because they are lazy, so will this model. Maybe that's a sign of intelligence :P

Ok but you understand there's a body of literature that shows that LLMs don't "just" memorize
+100 to that. My biggest scepticism is people actually creating a new problem while thinking they are solving problem. Don't get me wrong, translating natural language ideas into code is fun and all, the truth it is also code, yet in ambiguous language format given to the machine.

When did natural language became better for expressing development ideas than code? I know – when you don't know how to code in the first place. Then you will have to bet on all of the ambiguities of the language, cultural and meta-physical which words carry in order to hack your thing together instead of expressing yourself directly and explicitly.

Finally what is beautiful about strict code format we are so used to - it is truly the fastest and shortest path to get your thing done, in case you possess the knowledge needed.

Natural language isn't superior to computer languages. NL allows you to describe a software concept in a computer language and framework neutral way. The LLM generates the code. The real benefit is when you work across languages and frameworks. It is difficult to keep all of the details of all of the framework calls in your head all of the time.
Where is the evidence for that? Any real-world application made and running by describing software concepts to an LLM?

It is what it is – a novel search engine, lossy and non-credible. Effectively useless on codebases that extend beyond its fairly limited context

That sounds a lot like gatekeeping.

These tools will empower folks who aren’t developers to build stuff and maybe learn a bit more about how programming works.

They will enable folks who have ideas, but can’t express them, to actually be able to create what they are imagining.

That’s awesome.

Code isn’t beautiful (except for a few rare exceptions). Creating something with code is.

I agree it is a great tool for learning, but I don't believe anything more complex or of real use can be made AND maintained with it.
I think we’re probably way to early in the AI lifecycle to really form any strongly held beliefs yet.

In the 11 months since ChatGPT was released, things have come a long way. Who knows where we’ll be in another 11 months.

What I'm trying to say is that the problem is not approachable this way at all – efficiently generating code by describing what you want, since when you compress what you want into a prompt you lose the details, and in order to restore all of them you will need a much bigger prompt volume than code generated. Because it is code itself which compresses an idea but no idea can compress the code well enough. In another 11 months it will be exactly in the same spot - it will not be able to be more efficient at this task by the nature of it.
From a black box point of view and one angle, gpt is a web filter where it will try to find you the exact thing you are looking for but from memory. Vs google you have to distill all the info into what you need
"memorize" implies they can only recite things verbatim and that's ignoring the massive leap in being able to synthesize disjoint "memories" in new ways.
even if it's not true AI or even an architecture with the potential to become AI, LLMs are already good enough to provide real world value. Obviously "super autocomplete" isn't as sexy as true AI, but still very useful
if the benchmark means replicating the experience of taking technical interviews by most people, then this is a spot on approach and serves the potential user right.
LLMs are lossy compression
All models are, including humain brain.
The human brain is a model?
It models the world around it, so it's fairly similar to what GPT does, especially with the newly-added image capabilities and stuff.
But the brain itself is not a model.
Consciousness itself is a model of the world.

Our experience of the world is a model executing.

Comparing the latest neuroscience to latest neural networks. They look and behave very similarly.