Hacker News new | ask | show | jobs
by kbolino 459 days ago
There are two problems with this comparison. First, probabilistic prime generation has a mathematically proven lower bound that improves with iteration. There is no comparably robust tuning parameter with an LLM. You can use a different model, you can use a bigger variant of the same model, etc., but these all have empirically determined and contextually sensitive reliability levels that are not otherwise tunable. Second, the prime generation function will always give you an integer, and never an apple, or a bicycle, or a phantasm. LLMs regurgitate and hallucinate, which means that a simple error rate is not the only metric that matters. One must also consider how egregiously wrong and even nonsensical the errors can be.
3 comments

I think the better statement is that, if, say, you're running the Miller-Rabin test 10 times, you can be confident that an error in one test is uncorrelated with an error in the next test, so it's easy to dial up the accuracy as close to 1 as desired. Whereas with an LLM, correlated errors seem much more likely; if it failed three times parsing the same piece of data, I would have no confidence that the 4th-10th times would have the same accuracy rate as on a fresh piece of data. LLMs seem much more like the Fermat primality test, except that their "Carmichael numbers" are a lot more common.
I compare LLMs to a door with a slot where you put a piece of paper with a request on it and you get something back related to that request. If it's the same every time, great. But it might be different or completely wrong. You don't know what goes on behind the door and measuring the error rate tells you little predictive.
The general point is not that the feature currently exists to dial down the LLM parse error rate, it’s that the abstract argument “we can’t use LLMs because they aren't perfect” isn’t a realistic argument in the first place. You’re probably reading this on hardware that _probably_ shows you the correct text most all of the time but isn’t guaranteed to.
There's no such thing as a perfectly-watertight roof, therefore there's no qualitative difference between fixing the roof and buying a bigger bucket.
Precisely this. People dismiss utility of LLMs because they don't give 100% reliability, without considering the basic facts that:

- LLMs != ChatGPT interface, they don't need to be run in isolation, nor do they need to do everything end-to-end.

- There are no 100% reliable systems - neither technological nor social. Voltages fluctuate, radiation flips bit, humans confabulate just as much if not worse than LLMs, etc.

- We create reliability from unreliable systems.

LLMs aren't some magic unreliability pixie dust that makes everything they touch beyond repair. They're just another system with bounded reliability, and can be worked into larger systems just like anything else, and total reliability can be improved through this.

EDIT: In fact, my example with probabilistic primality tests is bad because those tests are too nice - they let us compute tight bounds on the error rate in advance. LLMs are not like that. But then, a lot of systems we rely in our daily lives also have this property - their reliability is established empirically, i.e. we improve them until they work reliably enough, and then we hope they'll keep on working, and deal with random failures when they occur. So that's nothing new, either.

No, LLMs do not have "bounded reliability". All reliability figures for LLMs are based upon empirical observation in specific contexts using artificial benchmarks. As they say in finance, "past performance is not indicative of future results".

Saying LLMs are no worse than random bit flips is, again, an unjustified comparison. We can control bit errors with ECC, we cannot control the output of an LLM except to shackle it into uselessness.

I said bounded. I didn't say how tight. But all of science is about bounding empirical observations, so this is nothing new - nor is relying on systems with empirically established failure rates, which is a good chunk of what engineering is about.
The number of 9s that can be assigned to these "bounds" currently is zero. They are not even 90% reliable. And there is no straightforward way to get to 90%, never mind 95%, 99%, etc. The sliding scale of reliability you originally presented just does not exist.

Yeah, sure, we can hypothetically engineer a system that tolerates a key step in the process which has, say, a 30% chance of being wrong, including a 10% chance of being dangerously wrong (appears correct but is broken in subtle ways), and a 5% chance of being batshit insane, but why would we? The amount of training, vetting, and supervision of human operators necessary to make a working process here immediately raises the question of whether the machine serves man or the other way around.

The best uses of an LLM are those where engineering levels of precision are neither required nor useful.

I see people hallucinate on HN all the time. We tolerate it. Why should we? We should if the overall inclusion of unreliable things (humans) provide value. The error rate for LLMs doesn't matter. The net value does. So if the value is great enough to tolerate the error rate, we do. We don’t categorically dismiss the technology because it can fail really poorly. We design things all the time which can fail catastrophically. Seriously. So LLMs will appear anywhere where the net value is positive. Maybe you’re taking a more nuanced stance, but I see a lot of “if it can hallucinate even once we can’t use it” rhetoric here. And that’s simply irrational. Even “we can’t use it for important things” is wrong. Doctors are using LLMs today to help collate observed data and suggest diagnoses. Trained professional in the loop mitigates the “terrible failure”. So no I don’t even agree that LLMs shall be relegated to non-important things.
Finance is an excellent analogy. Relying on LLM output is similar to relying on the stock market. You might come out ahead but it's always a gamble and the lower bound is always catastrophic failure.