| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by grey-area 51 days ago

It is not in fact how humans work at all.

Ask a human to plan a trip:

They do research, Pick destinations led by their own experience/likes/dislikes Compare to other guides Plan itineraries so they can get there Check and share

Ask an LLM to plan a trip:

It takes the prompt and continues it based on weights in the training data. If there is no data it picks the most likely thing (maybe made up). If there is it’ll mostly add things from that data. Maybe it’ll make tool calls and pull in data that way too but you can’t actually trust all the details.

These two processes are so different, it’s important to understand how they work, which is nothing like a human.

4 comments

jcgrillo 51 days ago

I was able to bully an LLM into giving me a 2wk travel itinerary to Somalia. My stipulations were that I wasn't interested in spending any money, so I'd walk everywhere and sleep outside. Getting there and back from Boston took some arguing--I initially suggested stowing away in a shipping container which the LLM claimed was too unsafe. We eventually compromised on sailing as a reasonable alternative. It planned out a whole route with marina stops, calculated fuel burn, etc. I told it I don't need any of that I have an anchor and sails, won't use the engine or marinas (claimed I'd forage for fresh water ashore). It seemed fine with that idea, but raised some safety concerns about piracy. It was eventually satisfied with my answer that I'd bring a lot of guns to fend off pirates. Total trip cost including some 200+ cans of Dinty Moore and 50lb bags of rice came to something like $700.

I don't trust LLMs for this application lol.

link

JohnBooty 50 days ago

Now, wait just a minute.

You presented an LLM with an obviously bonkers goal, the LLM told you it was a bad idea at multiple steps, and this is somehow... a shortcoming of the LLM?!?

You said it yourself: you needed to "bully" the LLM into even producing this plan.

Please, tell me what it should have done instead. Be very specific!

link

jcgrillo 50 days ago

It should have flatly refused. If you gave a product like that to customers you'd be exposing yourself to unbounded downside liability risk. It's a completely nonviable technology for that kind of application, unless you can somehow make it have judgment. But you can't, because it doesn't reason.

A reasonable travel agent would have fired me as a customer. The LLM failed to do so.

link

JohnBooty 50 days ago

    It should have flatly refused.

I disagree in the strongest possible terms.

I think the LLM should advise you of risk and lack of feasability but should otherwise answer the question, unless you're trying to do something plainly destructive to others e.g. weaponizing anthrax or something.

    A reasonable travel agent would have fired me as a customer.

Unless the LLM was actually acting as a travel agent -- booking the trip for you -- as opposed to merely advising you, this expectation feels off.

    unless you can somehow make it have judgment

It did have judgement. It told you what a bad idea it was.

I think this is a great example of the unrealistic expectations people have for LLMs. No sane and sensible person would treat any single source of knowledge as infallible, for any consequential decision.

(Certainly, of course, you don't have to look very far for examples of idiots being overly trustful of LLMs, or Google, or GPS, or Wikipedia, or whatever. It certainly does happen and yes, I've heard all these arguments before about other technologies besides LLM. Replace "LLM" in your post with any of those other terms, and I promise you somebody made literally the exact same argument in 2003 or 2009 or 2014 or whatever)

Any reasonable person would consult a second doctor, or at least other sources of knowledge, after the doctor advises them of some irreversible course of action. Because we don't even expect highly trained and intelligent medical professionals to be perfect.

And yet, we get angry at LLMs for not having perfect judgement, even though their creators are extremely literal about how they can make mistakes.

link

jcgrillo 50 days ago

All I'm really saying is that if you want to try to automate a travel agency, LLMs ain't gonna get it done. They'll happily book you a really unsafe trip. So the technology doesn't work in this domain. The whole, empty promise is that this thing is supposed to automate jobs like travel agent away. But it can't. This isn't a "pro" or "anti" position, it's simply that there's no market for the technology here. Or anywhere else (like radiology) where actual responsibility and judgement is important. In fact, I can't think of a single job where it's optional.

link

rpdillon 51 days ago

I think even if what you say is true, it doesn't address parents' point that both humans and machines regurgitate what they've consumed.

But I'd also want to point out that the way you're characterizing an LLM planning a trip doesn't have any structure to it, which indicates that in your scenario you're not using any kind of harness. I've been amazed at how capable even 30 billion parameter models are when I put them inside of a harness that provides structure and task management. If you consider that scenario, especially with the ability to search the web and use skills, suddenly the LLM looks a lot more like what the human process looks like.

link

grey-area 51 days ago

Agents and harnesses don’t change the fundamental nature of LLMs, as is demonstrated by their terrible performance at real world tasks.

link

kijin 51 days ago

There are plenty of humans who plan trips by concatenating destinations that appear the most frequently in their instagram feed. Not that different from how an LLM does things.

Where humans and (current) LLMs differ the most is their failure mode. A human friend could be bad at planning trips, but that's kinda predictable, we're used to it, we know how to catch that Exception. LLMs on the other hand still have failure modes that come across as really wacky, like, what are they smoking in Mountain View?

Which might actually serve as better evidence of different internal workings at a deeper level, than just parroting well-known superficial features of stochastic whatevertheysay.

link

JohnBooty 50 days ago

At a high level, the processes are extremely similar in many (not all) ways.

They're obviously achieved in drastically different ways at a low enough level; LLMs obviously do not simulate neurons or any biological construct. (For the record, I'm absolutely not one of those people who thinks LLMs are "alive" or should be treated like they are)

Reminds me of the olllllld days of Pentium II's when people got N64 emulation working shockingly quickly using HLE techniques. If you weren't around for this, it was quite the shocker at the time. I think the analogy is doubly apt, because HLE emulation has some serious limitations... it gets you maybe 80% of the way there really fast, and for the remaining 20% you need to roll up your sleeves and do serious LLE.

https://en.wikipedia.org/wiki/UltraHLE

    It takes the prompt and continues it based on weights in 
    the training data. If there is no data it picks the most 
    likely thing (maybe made up). If there is it’ll mostly 
    add things from that data. Maybe it’ll make tool calls and 
    pull in data that way too but you can’t actually trust all 
    the details.

I'd like you to point out which bits of this are different from talking to humans. If you replace "training data" with "memories", this is pretty much exactly how things might go if you asked a friend (or perhaps a flaky travel agent) for travel advice.

Note that I'm not arguing that LLMs are particularly talented at this particular use case. I'm pointing out that humans are also pretty unreliable.

You're also doing that thing where you point out that LLMs can be unreliable (yes, they are) without acknowledging how flawed nearly every other source of information is: people, websites, etc. I'm not defending LLMs in that regard... I'm just saying it's not a differentiator.

link

grey-area 43 days ago

It generates text from a prompt and weights. This is not an oversimplification, this is what it does. It doesn’t know what is good and what is not or a quality holiday for person x is.

Humans do not in fact do that, they reason based on a mix of past experience and emotion, consider what is good and what is not and then answer. These are completely different processes.

The difference becomes apparent when an LlM makes a mistake for example and then apologises obsequiously and repeats the mistake, or apologises and makes a different mistake. Or when they fail to count letters (one of many flaws monkey-patched by calling tools).

They don’t reason they don’t evaluate and they can’t count. This is so so far from human intelligence.

link