Hacker News new | ask | show | jobs
by _bin_ 481 days ago
there are "a few" problems with this.

if you know how you want something done, tough luck. LLMs, even the "really smart" ones, still often do it "their way". they use "their style" (whatever the most common way to write something might be) and "their preferred packages" (what ever the most common ones for the language are). i remember someone told me "hey dude try vercel's v0 it's so good" and i asked it for some basic svelte code. it spat out react.

if you are modifying an existing, non-AI codebase, it's really annoying for the same reason. if you have a preference for specific design patterns or code style, it's unlikely to work well without substantial prompting and re-trying.

they still can't really fix bugs. syntax errors sure, but actual time-costing logic bugs? figuring out lifetimes with rust? forget about it. all they do is add freaking print statements and say "try these things to fix it." no. you're the robot, you work for me, you do it.

they suck at functional languages/haskell. like they're really just bad.

lastly, they're interns, not employees. interns require hand-holding, supervision, and verbal abuse to get anything done right. bots are, for now, the same. they impose a cognitive load when you want something of any importance done: you can't actually trust anything it outputs, at all. you have to go re-check everything it does.

i remember a few days ago i wanted to parse a bunch of UDP packets from 10-20GB daily pcap dumps. I gave it the spec for the message format as a PDF and said "write this in rust", along with the existing (functional but slow) python implementation. this should be a simple case to apply an LLM: simple, routine, boilerplate code that can be next-token-predicted fairly simply, but still takes an annoying amount of time to type out. unfortunately it screwed up multiple times. it failed to use the pcap parsing crate (even when i supplied docs) because it probably wasn't frequent in its training corpus. more importantly, it just miswrote constants. like it would get the constant for length-checking a certain message type wrong despite it being plainly specified in the spec and the python version.

LLMs are cool research tech and I have friends who have used them to learn to write Python scripts and react webshit. in my opinion, they are of little value for "serious" programming. i realize that's an annoying and vaguely-conceited term but it's the best one I can think of at the moment. i look forward to when they actually work well.

in my opinion, a good improvement would be focusing on writing in at least somewhat-verifiable languages or writing s.t. pieces are verifiable. robot translates your request into rules, robot 2 writes the code from rules, SAT solver checks the check-able chunks for validity while robot 3 is specialized in checking unverifiable "connection points", use of side effects, etc. the "intern" problem is by far the biggest of what I've listed and this is probably the best way to solve it. once that's done, we can hopefully let these chug for a while until they get it right rather than giving users crappy output.

oh, and they MUST be tuned to be capable of saying, "I don't know."

1 comments

> they still can't really fix bugs. syntax errors sure, but actual time-costing logic bugs? figuring out lifetimes with rust? forget about it. all they do is add freaking print statements and say "try these things to fix it." no. you're the robot, you work for me, you do it.

Have you tried the reasoning models? I find that o1 really is able to find logic bugs, even tricky ones!

i'm glad they seem to work better for you. i sometimes seem to be the only person out there who can't get the same level of utility out of these models as others.

my guess is for some applications they can, but even reasoning models (o3-mini-high, grok3, sonnet 3.7, o1, deepseek, etc.) often fail to fix logic bugs. note that this isn't necessarily a form validation logic bug I'm referring to but, say, a pretty in-the-weeds tool for cleaning and pre-processing data for ML purposes. my guess is basic business-logic-y type stuff is much more doable.

i haven't really found a good way around cases where it either just adds printfs or loops through the same, non-working fixes repeatedly. they keep getting better, they're just not yet below my epsilon for unreliability.

You're not alone. Fixing bugs is usually really easy anyway, and it takes me more effort to feed the context to the LLM than to fix them myself.

The real complicated "bugs" often come from unclear requirements and the hard part is clearing up the requirements. It's more about design than logic errors in the code. And LLMs suck hard at this.

> even reasoning models [...] often fail to fix logic bugs.

I think "often" is the key word here. To be clear, they often fail for me to! But they also often work.

the problem is something that has gone from working 10% of the time to working 50% of the time still requires me to thoroughly review everything it does 100% of the time. hence my comment about "the intern problem".