| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by godelski 264 days ago

(Adding to your comment, not disagreeing)

  > The argument is that this technology leads people to be careless.

And this will always be a result of human preference optimization. There's a simple fact: humans prefer lies that they don't know are lies over lies that they do know are lies.

We can't optimize for an objective truth when that objective truth doesn't exist. So while doing our best to align our models they must simultaneously optimize they ability to deceive us. There's little to no training in that loop where outputs are deeply scrutinized, because we can't scale that type of evaluation. We end up rewarding models that are incorrect in their output.

We don't optimize for correctness, we optimize for the appearance of correctness. We can't confuse the two.

The result is: when LLMs make errors, those errors are difficult for humans you detect.

This results in a fundamentally dangerous tool, does it not? Tools that when they error or fail they do so safely and loudly. Instead this one fails silently. That doesn't mean you shouldn't use the tool but that you need to do so with an abundance of caution.

  > I could slow down and review it line-by-line, picking all the nits, but that moves against the grain of the tool.

Actually the big problem I have with coding with LLMs is that it increases my cognitive load, not decreases it. Bring over worked results in carelessness. Who among us does not make more mistakes when they are tired or hungry?

That's the opposite of lazy, so hopefully answers OP.

2 comments

johnisgood 264 days ago

I use LLMs for coding and I like it the way I am using it. I do not outsource thinking, and I do not expect it to know what I want without giving it context to my thoughts with regarding to the project. I have written a 1000 LOC program in C using an LLM. It was a success. I have reviewed it "line by line" though, I do not know why I would not do this. Of course it did not spit out 1000 LOC from the get go, we started small and we built upon our foundations. It has an idea of my thinking and my preferences with regarding to C and the project because of our interactions that gave it context.

godelski 264 days ago

  > I have written a 1000 LOC program in C using an LLM. 
  > I have reviewed it "line by line" though, I do not know why I would not do this.

1k LOC is not that much. I can easily do this in a day's project.

But it's pretty rare you're going to be able to review every line in a mature project, even if you're developing that project. Those can contain hundreds or even thousands of files with hundreds (hopefully not thousands) of LOC. While it's possible to review every line it's pretty costly in time and it's harder since the code is changing as you're doing this...

Think of it this way, did you also review all the lines of code in all the libraries you used? Why not? The reasoning will be pretty similar. This isn't to say we shouldn't spend more time exploring the code we work with nor that we likely wouldn't benefit from this, but that time is a scarce resource. So the problem is when the LLM is churning out code faster than you can review.

While coding you are hopefully also debugging and thinking. By handing coding over to the LLM you decouple this. So you reduce your time writing lines of code but increase time spent debugging and analyzing. There will be times where this provides gains but IME this doesn't happen in serious code. But yeah, my quick and dirty scripts can be churned out a lot faster. That saves time, but not 10x. At least not for me

johnisgood 263 days ago

So when people talk about safety, it does matter in Rust, right? Because "1k LOC is not that much. I can easily do this in a day's project.". Why should we choose Rust over anything below 100k LOC if it is nothing?

I am just asking. Everyone says 1k LOC is nothing, yet they want to replace 1k LOC in C with 1k LOC in Rust. You can do it in day. You are a professional!

Or what is your point? That 1k LOC projects are useless or pointless? Because if so, I seriously beg to differ.

> So the problem is when the LLM is churning out code faster than you can review.

I start small, I can review just fine.

godelski 263 days ago

  > when people talk about safety, it does matter in Rust, right?

I'm not sure I would make this about languages. Different languages have different advantages, but there's always a trade-off, right? For example, this `cp` issue is a bit of a problem for the coreutils rewrite[0]. I think you gotta ask the question: what benefit does rewriting in rust provide? Potentially more safety, but also something like coreutils has been heavily investigated for the past few decades. Rewriting also comes with the chance of introducing new bugs. So is it safer? Hard to say, right? Especially since Rust is still new and there's not a lot of major software written in it.

  > Everyone says 1k LOC is nothing
  > Or what is your point?

The point we're trying to make is that lines of code are not the bottleneck. Probably one of the big problems with our industry right now is an over reliance on metrics (KPIs). But what can you measure in coding? Lines? Commits? Tickets? Is any of that meaningful?

I said in another comment[1] that I've spent hours or days to write /one line/ of code, or even partial. Does that mean I was doing a bad job? Was I just slacking off? I think this is something many developers have experienced. Were we all lazy? Dumb?

I'd argue that you can't answer that question from the information alone. Sometimes a single line of code is crazy hard to figure out. If you haven't seen this before, allow me to introduce you to some old coding lore[2]

  //When I wrote this, only God and I understood what I was doing
  //Now, God only knows

The thread has other examples of where people have wasted time trying to understand some "magic". Or maybe you know Carmack's Fast Inverse Square Root Algo[3]. Look at that one. It's 7 LOC (5?) yet those lines are so powerful. That is not the type of code someone writes in a flow state, off the top of their head. That is the type of code you write because you used a profiler[4], found the bottleneck, and optimized the crap out of it. Writing 7 lines takes no time, but I'm sure that code took at least a week to write.

The point here is that it is really hard to measure the quality and effectiveness of a programmer. The context of the problem is not something that can be abstracted away when evaluating them. Unfortunately, this means to evaluate them you also need to be an expert programmer AND have meaningful context to understand the specific problems they are working on. There's not a thing you can do from a spreadsheet. The truth is that if you optimize from the spreadsheet you'll only introduce more Jira tickets. There's a joke that there's 2 types of 10x programmers. The other one is the programmer that introduces 100x the jira tickets while completing them 10x as fast. The problem is that that programmer doesn't see the bigger scope and makes mistakes that leads to new tickets. This might be like your new rockstar junior dev. They fill out tickets but are solving the problems in isolation, not in context of the codebase. This leads to more complexity and bugs later on, but that lag in effect is hard to measure/identify so it is easy to think they are a rockstar but actually a problem.

  > I start small, I can review just fine.

Yes, and this is how you should do it. I mentioned Unix Philosophy[5] previously. But the thing is that projects continue. Scope expands. If you want to keep writing small programs and integrate them together then you actually need to think quite carefully about the design and implementation of them (again, see Unix Philosophy).

So the point is that everything is highly context driven. That's what matters. You need nuance and care. It is not easy to say what makes good code or even identify it. So... LGTM

[0] https://github.com/uutils/coreutils/issues/7092

[1] https://news.ycombinator.com/item?id=45409430

[2] https://stackoverflow.com/questions/184618/what-is-the-best-...

[3] https://betterexplained.com/articles/understanding-quakes-fa...

[4] https://news.ycombinator.com/item?id=45060059

https://news.ycombinator.com/item?id=44416817

https://news.ycombinator.com/item?id=45060059

[5] https://en.wikipedia.org/wiki/Unix_philosophy

johnisgood 262 days ago

In that case, I agree with you with everything and I do actually try to do it the way you mentioned.

And I am expert programmer (I would like to believe) and I use LLMs just to get some refresher of my options and whatnot, and I choose where the project goes, with my knowledge. All my prompts are very specific, which requires knowledge.

kiitos 264 days ago

nobody in this or any meaningful software engineering discussion is talking about software projects that are 1000, or even 10000, SLoC. these are trivial and uninteresting sizes. the discussion is about 100k+ SLoC projects.

johnisgood 264 days ago

I do not see how this is always necessarily implied. And should I seriously always assume this is the case? Where are you getting this from? None of these projects people claim to successfully (or not) written with the help from LLM have 10k LOC, let alone >100k. Should they just be ignored because LOC is not >100k?

Additionally, why is it that whenever I mention success stories accomplished with the help of LLMs, people rush to say "does not count because it is not >100k LOC". Why does it not count, why should it not count? I would have written it by hand, but I finished much faster with the help of an LLM. These are genuine projects that solve real problems. Not every significant project has to have >100k LOC. I think we have a misunderstanding of the term "significant".

> nobody in this or any meaningful software engineering discussion is talking about software projects that are 1000, or even 10000, SLoC.

Why?

> these are trivial and uninteresting sizes.

In terms of what exactly?

Jensson 264 days ago

> Why?

Because small programs are really quick and easy to write, there was never a bottleneck making them and the demand for people to write small programs is very small.

The difficulty of writing a program scales super linearly with size, an experienced programmer in his current environment easily writes a 500 line program in a day, but writing 500 meaningful lines to an existing 100k line codebase in a day is not easy at all. So almost all developer time in the world is spent making large programs, small programs is a drop in an ocean and automating that doesn't make a big difference overall.

Small programs can help you a lot, but that doesn't replace programmers since almost no programmers are hired to write small programs, instead automatically making such small programs mostly helps replace other tasks like regular white collar workers etc whose jobs are now easier to automate.

godelski 264 days ago

  > but writing 500 meaningful lines to an existing 100k line codebase in a day is not easy at all.

I've had plenty of instances where it's taken more than a day to write /one line/ of code! I suspect most experienced devs have also had these types of experiences.

Not because the single line was hard to write but because the context in which it needed to be written.

Typing was never the bottleneck and I'm not sure why this is the main argument for LLMs (e.g. "LLMs save me from the boilerplate). When typing is a bottleneck it seems like it's more likely that the procedure is wrong. Things like libraries, scripts, and skeletons tend to be far better solutions for those problems. In tough cases abstraction can be extremely powerful, but abstraction is a difficult tool to wield.

The bottleneck is the thinking and analyzing.

bccdee 263 days ago

> Things like libraries, scripts, and skeletons tend to be far better solutions for those problems.

My feelings exactly.

LLM code generation (at least, the sort where people claim they're being 10X-ed) feels like it competes with frameworks. "An agent built this generic CRUD webapp on its own with only 30 minutes of input from me!"—well, I built an equivalent webapp in 30 minutes with Django. These are off-the-shelf solutions to solved problems. Yes, a framework like Django requires up-front learning, but in the end it leaves you with fewer lines of code to maintain, as opposed to custom-generated LLM code.

wild_egg 264 days ago

There's an argument to be made that this gap is actually highlighting design issues rather than AI limitations.

It's entirely possible to have a 100k LOC system be made up of effective a couple hundred 500 line programs that are composed together to great effect.

That's incredibly rare but I did once work for a company who had such a system and it was a dream to work in. I have to think AIs are making a massive impact there.

godelski 264 days ago

  > It's entirely possible to have a 100k LOC system be made up of effective a couple hundred 500 line programs that are composed together to great effect.

I'm confused. Are you imagining a program with 100k LoC is contained in a single file? Because you'd be insane to do such a thing. It's normally a lot of files with not LoC each, which de facto meets this criteria.

You may also wish to look at UNIX Philosophy. The idea that programs should be small and focused. A program should do one thing and do it well. But there's a generalization to this philosophy when you realize a function is a program.

I do agree there's a lot of issues with design these days but I think you've vastly oversimplified the problem.

bccdee 263 days ago

> It's entirely possible to have a 100k LOC system be made up of effective a couple hundred 500 line programs that are composed together to great effect.

To me, this sounds like an nightmare—I'm sure anyone who's worked at a shop with way too many microservices would agree. It's trivial to right-click a function call and jump to its definition; much harder to trace through your service mesh and find out what, exactly, is running at `load-balancer.kube.internal:8080/api`.

startupsfail 264 days ago

> There's a simple fact: humans prefer lies that they don't know are lies over lies that they do know are lies.

As an engineer and researcher, I prefer lies (models, simplifications), that are known to me, rather than unknown unknowns.

I don't need to know exact implementation details, knowledge of aggregate benchmarks, fault rates and tolerances is enough. A model is a nice to have.

This approach works, in science (physics, chemistry, biology, ...) and in engineering (including engineering agentic and social sustems- social engineering).

godelski 264 days ago

  > As an engineer and researcher, I prefer lies (models, simplifications), that are known to me, rather than unknown unknowns.

I think you misunderstood.

I'll make a corollary to help:

  ~> There's a simple fact: humans prefer lies that they believe are truths over lies that they do know are lies.

I'm insure if you: misread "lies that they don't know are lies", conflated unknown unknowns with known unknowns, or (my guess) misunderstood that I am talking about the training process which involves a human evaluator evaluating an LLM output. That last one would require the human evaluator to preference a lie over a lie that they do not know is actually a lie. I think you can see how we can't expect such an evaluation to occur (except through accident). For the evaluator to preference the unknown unknown they would be required to preference what they believe to be a falsehood over what they believe is truth. You'd throw out such an evaluator for not doing their job!

As a researcher myself, yes, I do also prefer known falsehoods over unknown falsehoods but we can only do this from a metaphysical perspective. If I'm aware of an unknown then it is, by definition, not an unknown unknown.

How do you preference a falsehood which you cannot identify as a falsehood?

How do you preference an unknown which you do not know is unknown?

We have strategies like skepticism to deal with this help with this but this doesn't make the problem go away. It ends up with "everything looks right, but I'm suspicious". Digging in can be very fruitful but is more frequently a waste of time for the same reason: if a mistake exists we have not identified the mistake as a mistake!

  > I don't need to know exact implementation details, knowledge of aggregate benchmarks, fault rates and tolerances is enough.

I think this is a place where there's a divergence in science and engineering (I've worked in both fields). The main difference in them is at what level of a problem you're working on. At the more fundamental level you cannot get away with empirical evidence alone.

Evidence can only bound your confidence in the truth of some claim but it cannot prove it. The dual to this is a much simpler problem, as disproving a claim can be done with a singular example. This distinction often isn't as consequential in engineering as there are usually other sources of error that are much larger.

As an example, we all (hopefully) know that you can't prove the correctness of a program through testing. It's a non-exhaustive process. BUT we test because it bounds our confidence about its correctness and we usually write cases to disprove certain unintended behaviors. You could go through the effort to prove correctness but this is a monumental task and usually not worth the effort.

But right now we're talking about a foundational problem and such a distinction matters here. We can't resolve the limits of methods like RLHF without considering this problem. It's quite possible that there's no way around this limitation since there are no objective truths the majority of tasks we give LLMs. If that's true then the consequence is that a known unknown is "there are unknown unknowns". And like you, I'm not a fan of unknown unknowns.

We don't actually know the fault rates nor tolerances. Benchmarks do not give that to us in the general setting (where we apply our tools). This is a very different case than, say, understanding the performance metrics and tolerances of an o-ring. That part is highly constrained and you're not going to have a good idea of how well it'll perform as a spring, despite those tests having a lot of related information.