| I've recently been writing some Nix modules for personal use (mostly flake-parts) and using o3 to help, since o3 is the first OpenAI model I've used that seems good enough for many tasks. It's useful for answering some questions about Nixpkgs conventions, but it's not much faster than just looking in the manuals or reading the source. But when it comes to looking at my actual code and answering questions about it, it's extremely hit and miss. It's good at constructing simple functions, but the module code it writes sometimes inappropriately imports idioms from other languages. It hallucinates often. And it's absolutely worse than useless for debugging Nix module system issues. It gives nonsense answers to questions about infinite recursion issues which manage to be plausible enough to make me waste a lot of time thinking about them, before I learned more details of the module system. After getting burned for following it down its rabbit holes, I unfortunately find myself ignoring most of its output related to this project, even as I continue to reflexively ask it things. I have often noticed in these cases that it turns out to have been right, but am left still with the empty feeling not that I shouldn't have bothered to figure out my own answer, but that I shouldn't have bothered to ask. All of that is to say: I think working in a language/ecosystem LLMs are "bad at" is a useful sanity check. The ways that LLMs suck at languages they suck with are instructive because they reiterate the nature of the things. The failure modes are still what you'd expect from a stochastic parrot, even as the models get "smarter". The massive training data pools for more popular programming ecosystems make it too easy to fool yourself into believing that these things can reason. The unevenness of their performance tells you what hasn't actually "generalized". |