Hacker News new | ask | show | jobs
by Twey 632 days ago
My favourite litmus test for ‘can LLMs reason about code?’ is to make up a programming language with familiar syntax but weird semantics. E.G.:

- all variables contain signed integers

- all variable names have block scope

- there is no variable declaration syntax: all variables are implicitly initialized at first use with the value 5

- all integer literals are expressions

- the expression `a + b` means to subtract the value on the left from the variable on the right, returning the previous value of the variable

- a program is a block

- a block is a sequence of statements enclosed in braces and separated by semicolons, and executed from bottom to top

- conditionals are introduced by the keyword `while`, followed by an expression, followed by a block that is executed only if the expression evaluates to 4

- loops are done by simply prefixing a block with an expression; if the expression evaluates to 0, the block will run indefinitely, otherwise the block will run a number of times indicated by the negation of the value

Et cetera. Then I ask the LLM to write a simple program (e.g. FizzBuzz). Even with a lot of hand-holding, I've yet to get an LLM to do this successfully, or even to answer questions about a program written in the language.

2 comments

I actually had pretty good results from taking a new language that was posted here and having GPT-4 try to interpret it. I don't remember what it was called, but it was APL-like, very symbol-dense, but not using standard symbols. It was too new to be included in any training data at the time, but GPT-4 did a good job of figuring out what each symbol meant.

I think it's not impossible for LLMs to write code like you're wanting. Maybe it's actually harder to redefine common idioms, but to be fair that happens with people too:

https://en.m.wikipedia.org/wiki/Stroop_effect

My favorite test is to ask the LLM to approximate the mental processes going on in my brain, and based on that, divinate what food I had for dinner last thursday. /s

I’m honestly quite tired of reading people’s favorite ways to break the LLM, like it’s some kind of an achievement. Always in the context of “See? It doesn’t really reason/know/understand X!”.

Yes, it breaks when asked to do complicated stuff. GPT4 was worse at it than o1, GPT3 broke on trivial queries, and GPT2 couldn’t do anything done. I don’t even interact with LLMs often, and I find this whole topic to be breathlessly obvious, boring and unproductive, and yet every single conversation about LLMs devolves into it. Sorry about the rant, but it needed to come out at some point.