Hacker News new | ask | show | jobs
by rustypotato 667 days ago
> But when developers put AI in consumer products, people expect it to behave like software, which means that it needs to work deterministically. If your AI travel agent books vacations to the correct destination only 90% of the time, it won’t be successful.

This is the fundamental problem that prevents generative AI from becoming a "foundational building block" for most products. Even with rigorous safety measures in place, there are few guarantees about its output. AI is about as solid as sand when it comes to determinism, which is great if you're trying to sell sand, but not so great if you're trying to build a huge structure on top of it.

8 comments

I've made this statement a bunch in other mediums: The reason AI software is always "AI software" and not just a useful product is because AI is fallible.

The reason we can build such deep and complex software system is because each layer can assume the one below it will "just work". If it only worked 99% of the time, we'd all still be interfacing with assembly, because we'd have to be aware of the mistakes that were made and deal with them, otherwise the errors would compound until software was useless.

Until AI achieves the level of determinism we have with other software, it'll have to stay at the surface.

Recent work from Meta uses AI to automatically increase test coverage with zero human checking of AI outputs. They do this with a strong oracle for AI outputs: whether the AI-generated test compiles, runs, and hits yet-unhit lines of code in the tested codebase.

We probably need a lot more work along this dimension of finding use cases where strong automatic verification of AI outputs is possible.

> with zero human checking of AI outputs

It can be hard enough for humans to just look at some (already consistently passing) tests and think, "is X actually the expected behavior or should it have been Y instead?"

I think you should have a look at the abstract, especially this quote:

> 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers

This tool sounds awesome in that it generated real tests that engineers liked! "zero human checking of AI outputs" is very different though, and "this test passes" is very different from "this is a good test"

Good points regarding test quality. One takeaway for me from this paper is that you can increase code coverage with LLMs without any human checking of LLM outputs, because it’s easy to make a fully automated checker. Pure coverage may not be super-interesting but it’s still fairly interesting and nontrivial. LLM-based applications that run fully autonomously without bubbling hallucinations up to users seem elusive but this is an example.
You hit the nail. It's been almost tragically funny how people frantically tried to juggle 5 bars of wet soap in recent 2 years solving problems that (from what I've seen so far) have been already solved in a (boring) deterministic way consuming much less resources.

Going further, our predecessors put so much work into getting non-deterministic electronics together providing us with a stable and _correct_ platform, it looks ridiculous how people were trying to squeeze another layer of non-determinism in between to solve the same classes of problems.

The irony here is that there are many domains using statistical methods, that bound the complexity and failure modes of statistical methods successfully. A lot of people struggle with statistics but in domains where the glove fits I think AI will slot in all across the stack really nicely.
But software works only 99% of the time. For some definition of work: 99% of days it's run, 99% of clicks, 99% of CPU time in given component, 99% of versions released and linked into some business' production binary, 99% of github tags, 99% of commits, 99% of software that that one guy says is battle-tested
If twenty components work 99% of the time, then they only have an 0.99^20 = 82% chance of working as a collective.

If your 5.1 GHz (billion instructions per second) CPU had a 0.00000001% chance of failing at a given instruction, you'd have a 40% chance of a crash every second.

If a flight had a 1% chance of killing everyone aboard 10 million people/day * 1% = 100,000 people would die every day from a plane.

Gamblers fallacy
Software works so much more than 99% of the time that it's a rather deliberate strawman to claim otherwise.

Newly-"AI"-branded things that I have touched work substantially less than 90% of the time. There are like 3 orders of magnitude difference, even people who aren't paying any attention at all are noticing it.

Do you have to write your code presuming that sometimes 'a + b' will be wrong? I don't.

Software pretty much always "works" when you consider the definition of work to be "does what the programmer told it too". AI? Not so much.

It’s all about limits and edge cases. a+b may “fail” at INT_MAX and at 0.1+0.2. You don’t `==` your doubles, you don’t (a+b)/2 your mid, and you don’t ask ai to just book you vacation. You ask it to “collect average sentiment from `these_5k_reviews()` ignoring apparently fake ones, which are defined as <…>”. You don’t care about determinism because it’s a statistical instrument.
> and you don’t ask ai to just book you vacation. You ask it to “collect average sentiment from `these_5k_reviews()` ignoring apparently fake ones, which are defined as <…>”.

That's exactly my point. You have to interact directly with the A.I. and be aware of what its doing.

That's not true. If software works correctly today then users can expect it to work correctly tomorrow. If it doesn't work any more that's a bug.
structured outputs help, paired with regular old systems design I think you can get pretty far. it really depends what you're building though.

>If your AI travel agent books vacations to the correct destination only 90% of the time

that would be using the wrong tool for the job. an AI travel agent would be very useful for making suggestions, either for destinations or giving a list of suggested flights, hotels etc, and then hand off to your standard systems to complete the transaction.

there are also a lot of systems that tolerate "faults" just fine such as image/video/audio gen

> that would be using the wrong tool for the job. an AI travel agent would be very useful for making suggestions

But that’s a recommendation engine and we have that already all over the place.

We have lists with shallowly gamed results all over the place, which work in owners/bots favor, not yours. You can’t expect something not running on your device (or on a gpu rented from a third party) to work in your interest.
And hopefully a real recommendation engine won't be weirdly biased towards different answers depending on the exact phrasing, tone, and idiom of the request.
Yeah how am I supposed to raise millions of dollars without a working product by selling yesterday’s tech??
i 100% percent agree. people get so caught up on trying to do everything 90% right with AI, but they forget there's a reason most websites offer at least 2 9's of uptime.
I’m not really sure what stance is here because you say you agree with the GP but then throw some figures that clearly disagree with the authors point (99% uptime is vastly greater than 90% accuracy).
> If your AI travel agent books vacations to the correct destination only 90% of the time, it won’t be successful.

Well, I don't agree. I think there are ways to make this successful, but you have to be honest about the limitations you're working it with and play to your strengths.

How about an AI travel agent that gets your itineraries at a discount with the caveat that you be ready for anything. Like old, cheap standby tickets where you just went wherever there was an empty seat that day.

Or how about an AI Spotify for way less money than current Spotify. It's not competing on quality, it can't. Occasionally you'll hear weird artifacts, but hey it's way cheaper.

That could work, imo

We've had good, free (non ai) media recommendation tools in the past and they got killed by licensing agreements.

AI is creating a post-scarcity content economy where quality is going to be the only driver of value.

If you are the rights holder of any premium human created media content you are not going to let a 'cheap' AI tool get access to recommend it out to people.

The AI travel agent is trivial to solve though. It's the same as the human travel agent. Put the plan and pricing together, then give it to the user to sign and accept. Do it in an app, do it in an email, do it on a piece of paper, whatever floats your boat, but give them something they can review and accept instead of trying to do everything verbally or in a basic chat interface.

I'm not disagreeing with the "needs to work deterministically" -- there is a need for that, but this is a poor example. "Hey robot, plan a trip to Mexico" might still save me time overall if done right, and that has value.

It just needs to beat all the other non-deterministic processes at accuracy.

Call centre workers are often dreadfully inaccurate as well. Same with support engineers.

Heck even for banking, there are enormous teams fixing every screw up made by some other employee.

I have a question for folks working heavily with AI blackboxes related to this - what are methods that companies use to test the quality of outputs? Testing the integration itself can be treated pretty much the same as testing around any third-party service, but what I've seen are some teams using models to test the output quality of models... which doesn't seem great instinctively
Take this with a grain of salt because I haven't done it myself, but I would treat this the same as testing anything that uses some element of random.

If you're writing a random number generator, that generates numbers between 0 and 100. How would you test it? Throw your hands up in the air and say nope, can't test it, it's not deterministic! Or maybe you can just run it 1000 times and make sure all the numbers are indeed between 0 and 100. Maybe count up the number frequencies and verify its uniform. There's lots of things you can check for.

So do the same with your LLMs. Test it on your specific use-cases. Do some basic smoke tests. Are you asking it yes or no questions? Is it responding with yes or no? Try some of your prompts on it, get a feel for what it outputs, write some regexes to verify the outputs stay sane when there's a model upgrade.

For "quality" I don't think there's a substitute than humans. Just try it. If the outputs feel good, add your unit tests. If you want to get scientific, do blind tests with different models and have humans rate them.

But a knowledgeable human can take the iternarary and run with it. I know I’ve done that with code enough from AI generated stuff, it’s basically boiler plate. You still run it through the same tests, reviews, and verification as you would have had to do anyway.
And yet, generative AI also seems to be poor at randomness. When I ask Google Gemini for a list of 50 random words, it gave me a list of 18 unique words, with 16 of them repeated exactly 3 times.

Abyss: 1 Ambiguous: 3 Cacophony: 3 Crescendo: 3 Ephemeral: 3 Ethereal: 3 Euphoria: 3 Labyrinth: 3 Maverick: 3 Melancholy: 3 Mellifluous: 3 Nostalgia: 3 Oblivion: 3 Paradox: 3 Quixotic: 1 Serendipity: 3 Sublime: 3 Zenith: 3

Randomness is difficult. I wouldn't expect any LLM to be able to reliably produce random anything, except in the cases where they have access to tools (ChatGPT Code Interpreter could use Python's random.random() for example).
Okay, but repeating words?
Makes sense to me: It's not understanding and complying with your request, it's doing statistics to guess what kind of follow-up would be most common.
Are you using regular or pro? As pro has no issues with this task.
Regular user. For useful tasks, how much of a difference does the Pro plan make?
Nowhere near as good as ChatGPT 4o or Claude (in not one case have I had it outperform those other two), but at least it can do math and data science correctly most of the time compared to the regular model.

I use it as a secondary when the other two are chewing on other tasks already.

I only own it as I am an outrageously heavy consumer of LLMs for all sorts of little projects at once and they all seem to pause one window if you use another.