Hacker News new | ask | show | jobs
by skydhash 207 days ago
If a calculator gives me 5 when I do 2+2, I throw it away.

If a PC crashes when I uses more than 20% of its soldered memory, i throw it away.

If a mobile phone refuses to connect to a cellular tower, I get another one.

What I want from my tools is reliability. Which is a spectrum, but LLMs are very much on the lower end.

6 comments

You can have this position, but the reality is that the industry is accepting it and moving forward. Whether you’ll embrace some of it and utilize it to improve your workflow, is up to you. But over-exaggerating the problem to this point is kinda funny.
"You exaggerate, and the evidence is PMs are pushing it. PMs can't be wrong, can they?" Somebody really has to know what makes developers tick to write ragebait this good.

I can't even get the most expensive model on Claude to use "ls" correctly, with a fresh context window. That is a command that has been unchanged in linux for decades. You exaggerate how reliable these tools are. They are getting more useless as more customers are added because there is not enough compute.

I’m not sure what you’re talking about, because I have a completely different experience.
Sorry you're being downvoted even though you're 100% correct. There are use cases where the poor LLM reliability is as good or better than the alternatives (like search/summarization), but arguing over whether LLMs are reliable is silly. And if you need reliability (or even consistency, maybe) for your use case, LLMs are not the right tool.
Honestly, LLMs are about as reliable as the rest of my tools are.

Just yesterday, AirDrop wouldn't work until I restarted my Mac. Google Drive wouldn't sync properly until I restarted it. And a bug in Screen Sharing file transfer used up 20 GB of RAM to transfer a 40 GB file, which used swap space so my hard drive ran out of space.

My regular software breaks constantly. All the time. It's a rare day where everything works as it should.

LLMs have certainly gotten to the point where they seem about as reliable as the rest of the tools I use. I've never seen it say 2+2=5. I'm not going to use it for complicated arithmetic, but that's not what it's for. I'm also not going to ask my calculator to write code for me.

What I want from my tools is autonomy/control. LLMs raise the bar on being at the mercy of the vendor. Anything you can do with an LLM today can silently be removed or enshittified tomorrow, either for revenue or ideological reasons. The forums for Cursor are filled with people complaining about removed features and functional regressions.
Except it's more a case of "my phone won't teleport me to Hawaii sad faec lemme throw it out" than anything else.

There are plenty of people manufacturing their expectations around the capabilities of LLMs inside their heads for some reason. Sure there's marketing; but for individuals susceptible to marketing without engaging some neurons and fact checking, there's already not much hope.

Imagine refusing to drive a car in the 60s because they haven't reach 1kbhp yet. Ahaha.

> Imagine refusing to drive a car in the 60s because they haven't reach 1kbhp yet. Ahaha.

That’s very much a false analogy. In the 60s, cars were very reliable (not as much as today’s cars) but it was already an established transportation vehicle. 60s cars are much closer to todays cars than 2000s computers are to current ones.

It's even worse, because even with an unreliable 60s car you could at least diagnose and repair the damn thing when it breaks (or hire someone to do so). LLMs can be silently, subtly wrong and there's not much you can do to detect it let alone fix it. You're at the mercy of the vendor.
> What I want from my tools is reliability. Which is a spectrum, but LLMs are very much on the lower end.

"reliability" can mean multiple things though. LLM invocations are as reliable (granted you know how program properly) as any other software invocation, if you're seeing crashes you're doing something wrong.

But what you're really talking about is "correctness" I think, in the actual text that's been responded with. And if you're expecting/waiting for that to be 100% "accurate" every time, then yeah, that's not a use case for LLMs, and I don't think anyone is arguing for jamming LLMs in there even today.

Where the LLMs are useful, is where there is no 100% "right or wrong" answer, think summarization, categorization, tagging and so on.

I’m not a native English speaker so I checked on the definition of reliability

  the quality of being able to be trusted or believed because of working or behaving well
For a tool, I expect “well” to mean that it does what it’s supposed to do. My linter are reliable when it catches bad patterns I wanted it to catch. My editor is reliable when I can edit code with it and the commands do what they’re supposed to do.

So for generating text, LLMs are very reliable. And they do a decent job at categorizing too. But code is formal language, which means correctness is the end result. A program may be valid and incorrect at the same time.

It’s very easy to write valid code. You only need the grammar of the language. Writing correct code is another matter and the only one that is relevant. No one hire people for knowing a language grammar and verifying syntax. They hire people to produce correct code (and because few businesses actually want to formally verify it, they hire people that can write code with a minimal amount of bugs and able to eliminate those bugs when they surface).

> For a tool, I expect “well” to mean that it does what it’s supposed to do

Ah, then LLMs are actually very reliable by your definition. They're supposed to output semi-random text, and whenever I use them, that's exactly what happens. Except for the times I create my own models and software, I basically never see any cases where the LLM did not output semi-random text.

They're not made for producing "correct code" obviously, because that's a judgement only a human can do, what even is "correct" in that context? Not even us humans can agree what "correct code" is in all contexts, so assuming a machine could do so seems foolish.

I'm a native English speaker. Your understanding and usage of the word "reliability" is correct, and that's the exact word I'd use in this conversation. The GP is playing a pointless semantics game.
It's not semantics, if the definition is "it does what it’s supposed to do" then probably all of the currently deployed LLMs are reliable according to that definition.
> "it does what it’s supposed to do"

That's the crux of the problem. Many proponents of LLMs over promise the capabilities, and then deny the underperformance through semantics. LLMs are "reliable" only if you're talking about the algorithms behind the scene and you ignore the marketing. Going off the marketing they are unreliable, incorrect, and do not do what they're "supposed to do".

But maybe we don't have to stoop down to the lowest level of conversation about LLMs, the "marketing", and instead do what most of us here do best, focus on the technical aspects, how things work, and how we can make them do our bidding in various ways, you know like the OG hacker.

FWIW, I agree LLMs are massively over-sold for the average person, but for someone who can dig into the tech, use it effectively and for what it works for, I feel like there is more interesting stuff we could focus on instead of just a blanket "No and I won't even think about it".