| there are "a few" problems with this. if you know how you want something done, tough luck. LLMs, even the "really smart" ones, still often do it "their way". they use "their style" (whatever the most common way to write something might be) and "their preferred packages" (what ever the most common ones for the language are). i remember someone told me "hey dude try vercel's v0 it's so good" and i asked it for some basic svelte code. it spat out react. if you are modifying an existing, non-AI codebase, it's really annoying for the same reason. if you have a preference for specific design patterns or code style, it's unlikely to work well without substantial prompting and re-trying. they still can't really fix bugs. syntax errors sure, but actual time-costing logic bugs? figuring out lifetimes with rust? forget about it. all they do is add freaking print statements and say "try these things to fix it." no. you're the robot, you work for me, you do it. they suck at functional languages/haskell. like they're really just bad. lastly, they're interns, not employees. interns require hand-holding, supervision, and verbal abuse to get anything done right. bots are, for now, the same. they impose a cognitive load when you want something of any importance done: you can't actually trust anything it outputs, at all. you have to go re-check everything it does. i remember a few days ago i wanted to parse a bunch of UDP packets from 10-20GB daily pcap dumps. I gave it the spec for the message format as a PDF and said "write this in rust", along with the existing (functional but slow) python implementation. this should be a simple case to apply an LLM: simple, routine, boilerplate code that can be next-token-predicted fairly simply, but still takes an annoying amount of time to type out. unfortunately it screwed up multiple times. it failed to use the pcap parsing crate (even when i supplied docs) because it probably wasn't frequent in its training corpus. more importantly, it just miswrote constants. like it would get the constant for length-checking a certain message type wrong despite it being plainly specified in the spec and the python version. LLMs are cool research tech and I have friends who have used them to learn to write Python scripts and react webshit. in my opinion, they are of little value for "serious" programming. i realize that's an annoying and vaguely-conceited term but it's the best one I can think of at the moment. i look forward to when they actually work well. in my opinion, a good improvement would be focusing on writing in at least somewhat-verifiable languages or writing s.t. pieces are verifiable. robot translates your request into rules, robot 2 writes the code from rules, SAT solver checks the check-able chunks for validity while robot 3 is specialized in checking unverifiable "connection points", use of side effects, etc. the "intern" problem is by far the biggest of what I've listed and this is probably the best way to solve it. once that's done, we can hopefully let these chug for a while until they get it right rather than giving users crappy output. oh, and they MUST be tuned to be capable of saying, "I don't know." |
Have you tried the reasoning models? I find that o1 really is able to find logic bugs, even tricky ones!