Hacker News new | ask | show | jobs
by brookst 4 days ago
Odd take. So if it identified 17 real gaps and helped fix them, the fact it was wrong about one gap, and the appropriate humans caught it and no harm was done, the whole thing is useless?

Not saying that is the situation, I don’t know. But if “one error is too many” is your point of view… do you think the humans in these orgs are 100% perfect 100% of the time?

1 comments

> So if it identified 17 real gaps and helped fix them, the fact it was wrong about one gap, and the appropriate humans caught it and no harm was done

How many gaps have humans not caught?

> But if “one error is too many” is your point of view

Yes, in regulated industries "one error is too many" is the only right approach.

Yes, humans also make errors, and there you have a range of options: from tracing and finding the causes of the error (and tightening processes) to literally jailing those responsible. Your hallucination machine will happily "identify" 17 gaps, and create 34 more. And no, there are no processes to make it better. The "make no mistakes" incantation will happily be ignored for obvious reasons, regardless of how many forms of it you throw at it.

It doesn't seem like you're engaging with the material circumstances described above. What does it mean for a human to not catch that a part of a codebase is actually compliant with regulations? What does it mean for the hallucination machine to create 34 more gaps when it doesn't appear to have more than read access? How would it not be useful to have a machine that identifies 17 real crimes that your highly regulated business is unintentonally committing even with a 90% false positive rate?
In a regulated industry 90% false positive rate is indistinguishable from 100% failure rate. Hell, in any industry.

You're basically saying "we need human review for literally everything AI outputs because we have no way of saying whether anything it produces is hallucination or not. And since it produces plausible-sounding things really fast, it puts enormous burden on human reviewers".

I just don't understand where your position is coming from here. You can't distinguish between a machine that says "here, look at these 170 results, 10% of them are highly serious problems that you should address, you should have some people look into that" and one that shrugs and says "I dunno, maybe just double check everything"? I assume you've come to this conclusion based on some reasoning, but you're not sharing it in this response AFAICT.
> You can't distinguish between a machine that says "here, look at these 170 results, 10% of them are highly serious problems that you should address

The machine doesn't say that. It says "Here are 170 completely correct and verified results".

You have to check and verify all of those results yourself, and on any given day it can be anywhere from 0% to 100% incorrect.

> I assume you've come to this conclusion based on some reasoning, but you're not sharing it in this response AFAICT.

The reasoning comes from actually working with AI tools. And the reasoning can be seen in the actual comment this tgread started from: https://news.ycombinator.com/item?id=48434824

We're assuming per earlier hypothetical that it has a 10% correctness rate. You had said

>In a regulated industry 90% false positive rate is indistinguishable from 100% failure rate

So defending that position on the basis of it not actually being a 90% failure rate would mean you shouldn't have taken it in the first place. The fact that the LLM lies about its failure rate is nearly irrelevant; the machine could output the literal string "The following is 90% likely to be a false positive: " followed by the LLM output.

The reasoning in the comment that started the thread is "it's annoying to redo human review". Your position as I understand it is that there is no or negative business value to a tool that spit out a list of potential issues of which 10% are real issues with your business. This is what I fail to understand. This would be an incredibly useful first step towards any audit and would save loads of money. Why not?