|
|
|
|
|
by chromaton
89 days ago
|
|
I did something very similar last year, but with programming languages that were REALLY out of distribution; they were generated specifically for the benchmark. I call it TiānshūBench (天书Bench): https://jeepytea.github.io/general/introduction/2025/05/29/t... Some models were OK at solving very simple problems, but nearly all of them would, for example, hallucinate control structures that did not exist in the target language. |
|
Sometimes I think LLMs are unbelievably, amazingly good at things. And sometimes I’m deeply suspicious that they really not very smart, and this was an example of the latter.
[0] Python calling to C, passing a callback function pointer and a void *opaque that C will pass back to the callback. Short of writing an extension module, this is pretty much forced to go through an inherently nasty JIT codegen process in libffi, which is sort of tolerable, but you really don’t want to redo it for each object that gets opacified to void*. Codex passed a lambda, which did the nasty JIT thing every time. I wrote a little shim using weakref. Apparently no one has done this before, so Codex wasn’t trained on it, and it couldn’t make itself call the function. Maybe I should post it to PyPI.