Hacker News new | ask | show | jobs
by ToValueFunfetti 8 days ago
It doesn't seem like you're engaging with the material circumstances described above. What does it mean for a human to not catch that a part of a codebase is actually compliant with regulations? What does it mean for the hallucination machine to create 34 more gaps when it doesn't appear to have more than read access? How would it not be useful to have a machine that identifies 17 real crimes that your highly regulated business is unintentonally committing even with a 90% false positive rate?
1 comments

In a regulated industry 90% false positive rate is indistinguishable from 100% failure rate. Hell, in any industry.

You're basically saying "we need human review for literally everything AI outputs because we have no way of saying whether anything it produces is hallucination or not. And since it produces plausible-sounding things really fast, it puts enormous burden on human reviewers".

I just don't understand where your position is coming from here. You can't distinguish between a machine that says "here, look at these 170 results, 10% of them are highly serious problems that you should address, you should have some people look into that" and one that shrugs and says "I dunno, maybe just double check everything"? I assume you've come to this conclusion based on some reasoning, but you're not sharing it in this response AFAICT.
> You can't distinguish between a machine that says "here, look at these 170 results, 10% of them are highly serious problems that you should address

The machine doesn't say that. It says "Here are 170 completely correct and verified results".

You have to check and verify all of those results yourself, and on any given day it can be anywhere from 0% to 100% incorrect.

> I assume you've come to this conclusion based on some reasoning, but you're not sharing it in this response AFAICT.

The reasoning comes from actually working with AI tools. And the reasoning can be seen in the actual comment this tgread started from: https://news.ycombinator.com/item?id=48434824

We're assuming per earlier hypothetical that it has a 10% correctness rate. You had said

>In a regulated industry 90% false positive rate is indistinguishable from 100% failure rate

So defending that position on the basis of it not actually being a 90% failure rate would mean you shouldn't have taken it in the first place. The fact that the LLM lies about its failure rate is nearly irrelevant; the machine could output the literal string "The following is 90% likely to be a false positive: " followed by the LLM output.

The reasoning in the comment that started the thread is "it's annoying to redo human review". Your position as I understand it is that there is no or negative business value to a tool that spit out a list of potential issues of which 10% are real issues with your business. This is what I fail to understand. This would be an incredibly useful first step towards any audit and would save loads of money. Why not?