Hacker News new | ask | show | jobs
by wg0 740 days ago
Actually, just in three to five years, lots of "AI boxes" and those magical sparkling icons next to input fields summoning AI would be silently removed.

LLMs are not accurate, they aren't subject matter experts that'll be maybe within 5% error margin.

People will gradually learn and discover anf the cost of keeping a model updated and running won't drastically reduce so we'll most likely see dust settling down.

7 comments

I truly hope the reckless enthusiasm for LLMs will cool down, but it seems plausible that discretized, compressed versions of today's cutting-edge models will eventually be able to run entirely locally, even on mobile devices; there are no guarantees that they'll get better, but many promising opportunities to get the same unreliable results faster and with less power consumption. Once the models run on-device, there's less of a financial motivation to pull the plug, so we could be stuck with them in one form or another for the long haul.
I don't believe this scenario to be very likely because a lot of the 'magic' in current LLMs (emphasis on 'large') is derived from the size of the training datasets and amount of compute they can throw at training and inference.
Llama 3 8B captures that 'magic' fairly well and runs on a modest gaming PC. You can even run it on an iPhone 15 if you're willing to sacrifice floating point precision. Three years from now I full expect GPT4 quality models running locally on an iPhone.
Three years is more than twice the time since GPT-4 was released to now. Almost twice the time ChatGPT existed. At this rate, even if we'll end up with GPT-4 equivalents runnable on consumer hardware, the top models made available by big players via API will make local LLMs feel useless. For the time being, the incentive to use a service will continue.

It's like a graphics designer being limited to chose between local MS Paint, and Adobe Creative Cloud. Okay, so Llama 3 8B, if it's really as good as you say, graduates to local Paint.NET. Not useless per se, but still not even in the same class.

No one knows how it will all shake out. I'm personally skeptical scaling laws will hold beyond GPT4 sized models. GPT4 is likely severely undertrained given how much data facebook is using to train their 8B parameter models. Unless OpenAI has a dramatic new algorithmic discovery or a vast trove of previously unused data, I think GPT5 and beyond will be modest improvements.

Alternatively synthetic data might drive the next generation of models, but that's largely untested at this point.

The one thing people overlook is the user data on ChatGPT. That's OpenAI's real moat. That data is "free" RLHF data and possibly, training data.
I know this isn’t really the point, but Adobe CC hasn’t really improved all that much from Adobe CS, which was purely local and perfectly capable. A better analogy might be found in comparing Encyclopedia Brittanica to Wikipedia. The latter is far from perfect, but an astounding expansion of accessible human knowledge that represents a full, worldwide paradigm shift in how such information is maintained, distributed, and accessed.

On the same token, those of us who are sufficiently motivated can maintain and utilize a local copy of Wikipedia…frequently for training LLMs at this point, so I guess the snake has come around, and we’ve settled into a full-on ouroboros of digital media hype. ;-)

They're extremely pessimistic, 3 years is 200% of how long it took ChatGPT 3.5.

Llama 8B is ChatGPT 3.5 (18 months before L3), running on all new iPhones released since October 2022, (19 months before L3). That includes multimodal variants (built outside Facebook).

Just imagine if you had an accurately currated dataset.
I just want to sit down in front of my TV, put on my Bluetooth headphones and have the headphones and TV connect automatically.

Then, when I’m downstairs in my office and want to listen to music on my iPhone. I want my headphones to connect to my iPhone and not my TV upstairs!

I don’t need Skynet, I just need my devices to be a little less stupid.

I would consider that akin to magic at this point. Let’s start there and work our way up to handing over control of our nuclear arsenal.

The University of Washington is studying an AI application where a pair of headphones will isolate a single voice in a crowd when one simply looks at them. Amazing stuff…until you try it anywhere near your car, and then it starts playing the voice over your car stereo (presumably).
> People will gradually learn and discover anf the cost of keeping a model updated and running won't drastically reduce so we'll most likely see dust settling down.

As mentioned elsewhere, 3 to 5 years is some 3x to 5x as long as GPT-4 exists; some 2-3x as long as ChatGPT exists and LLMs suddenly graduated from being obscure research projects to general-purpose tools. Do you really believe the capability limit has already been hit?

Not to mention, there's lots of money and reputation invested in searching for alternatives to current transformer architecture. Are you certain that within the next year or two, one or more of the alternatives won't pan out, bringing e.g. linear scaling in place of quadratic, without loss of capabilities?

I'm pretty sure that statistical foundations of AI where a thing just been shy of 0.004 of the threshold value out of a million dimensional space can get miscategrized as something else will not deliver AGI or any useable and reliable AI for that matter other than that sequence of sequence mapping (voice to text, text to voice etc.) applications.

As for money and reputation, that's a lot behind gold making too in medieval times and look where that lead too.

Scientific optimism is a thinking distortion and a fallacy too.

Tool seems like a strong term for whatever ChatGPT is right now. Absurdly overhyped curiosity? Insanely overengineered autocorrect? Dystopian MadLibs? Wall Street Wank Sock?

I’m not trying to downplay its potential, but I don’t know of anyone who trusts it enough for what I’d consider “tooling”.

LLMs are not accurate, they aren't subject matter experts that'll be maybe within 5% error margin.

You're asserting that the AI features will be removed in 3 to 5 years because they're not accurate enough today, but you actually need them to remain inaccurate in 3 years time for your prediction to be correct.

That seems unlikely. I agree that people will start to realize the cost, but the accuracy will improve, so people might be willing to pay.

The same argument can be used for Tesla full self driving: basically it has to be (nearly) perfect, and after years of development, it's not there yet. What's different about LLMs?
They don't have to be perfect to be useful, and death isn't the price of being wrong.
Death actually can be the price of being wrong. Just wait for someone to do the wrong thing with an AI tool they weren't supposed to use for what they were doing, and the AI to spit out the worse possible "hallucination" (in terms of outcome).
What you say is true, however with self-driving cars death, personal injury, and property damage are much more immediate, much more visible, and many of the errors are of a kind where most people are qualified to immediately understand what the machine did wrong.

An LLM that gives you a detailed plan for removing a stubborn stain in your toilet that involves mixing the wrong combination of drain cleaners and accidentally releasing chlorine, is going to happen if it hasn't already, but a lot of people will read about this and go "oh, I didn't know you could gas yourself like that" and then continue to ask the same model for recipes or Norwegian wedding poetry because "what could possibly go wrong?"

And if you wonder how anyone can possibly read about such a story and react that way, remember that Yann LeCun says this kind of thing despite (a) working for Facebook and (b) Facebook's algorithm gets flack not only for the current teen depression epidemic, but also from the UN for not doing enough to stop the (ongoing) genocide in Myanmar.

It's a cognitive blind spot of some kind. Plenty smart, still can't recognise the connection.

Google’s recent AI assistant has already been documented recommending people mix bleach and white vinegar for cleaning purposes.

Someone’s going to accidentally kill themselves based on an AI hallucination soon if no one has already.

There's hundreds+ of companies making LLMs we can choose from, and the switching cost is low. There's only one company that can make self-driving software for Tesla. Basically, competition should lead to improvements.
Tesla aren't the only people trying to make self-driving cars, famously Uber tried and Waymo looks like they're slowly succeeding. Competition can be useful, but it's not a panacea.
Mercedes seems to be eating Tesla’s breakfast on FSD, in particular where safety and real-world implementation is concerned. Their self-driving vehicles are equipped with aqua-colored lights to alert other drivers that it is being controlled via computer, and Mercedes has chosen to honor its liability for incidents/accidents.
In Europe yes, especially with the Level 3, it means that Mercedes is taking the liability.

In the US it's different, because the US' FSD has nothing to do with the capabilities of the FSD in Europe (which is some sort of glorified driver assist), and it can clear navigate in many streets.

Mercedes in the US is very limited.

GPT-4 is 1 year old; 3.5 is 1 and a half. Before 3.5, this wasn't really a useful technology. 7 years ago it was a research project that Google saw no value in pursuing.
Anyone claiming that accuracy of AI models WILL improve is either unaware of how they really work or is a snake oil salesman.

Forget about a model that knows EVERYTHING. Let's just train a model that only is expert in not all the law of United states just one state and not even that, just understands FULLY the tax law of just one state to the extent that whatever documents you throw at it, it beats a tax consultancy firm every single time.

If even that were possible, OpenAI et.el would be playing this game differently.

Why does a mobile app needs to beat a highly trained professional every single time in order to be useful?

Is this standard applied to any other app?

Those use cases are never sold as "Mobile apps", but rather as "enterprise solutions", that cost the equivalent of several employees.

An employee can be held accountable, and fired easily. An AI? You'll have to talk to the Account Manager, and sit through their attempts to 'retain' you.

Because it's taxation. Financial well being is at stack. We're even looking at a potential jail time for tax fraud, tax evasion and what not.

My app is powered by GTPChatChat, the model beating all artificially curated benchmarks.

Still wanna buy?

This is one of those "perfect is the enemy of good" situations. Sure, for things where you have a legal responsibility to get things perfectly right using an LLM as the full solution is probably a bad idea (although lots of accountants are using them to speed up processes already, they just check outputs). That isn't the case for 99% of task though. Something that's mostly accurate is good. People are happy with that, and they will buy it.
My experience suggests that LLMs become not less accurate, but less helpful.

Two years ago they output a solution for my query [1] right away, now they try to engage user to implement that thing. This is across the board, as far as I can see.

These LLMs are not about helping anyone, their goals are engagement and mining data for that engagement.

[1] The query is "implement blocked clause decomposition in haskell." There are papers (circa 2010-2012), there are implementations, but not in Haskell. BCD, itself, is easy, and can be expressed in a dozen-two lines of Haskell code.

> These LLMs are not about helping anyone, their goals are engagement and mining data for that engagement.

Wow, this is a really interesting idea! A sneaky play for LLM providers is to be helpful enough to still be used, but also sufficiently unhelpful that your users give you additional training data.

This is obvious in retrospect - instead of making LLMs work better, LLM's handlers invented various techniques to make LLMs to look like they work better, one such example is summarization. Next gen LLMs then get trained on that data.

Now instead of having some answer right away, the user has to engage in discussion, which increases the cost that is sunk into the work with LLMs.

I don't think that's really what Apple is going to do with it though, it's not going to be for factual question and answer stuff. It will be used more like a personal assistant, what's on my calendar this week, who is the last person who called me etc. I think it will more likely be an LLM in the background that uses tools to query iCloud and such, ie, making Siri actually useful.
How do you define a percent error margin on the typical output of something like ChatGPT? IIRC the image generation folks have started using metrics like subjective users ratings because this stuff is really difficult to quantify objectively.
IMHO the terribly overlooked issue with generative AI is that the end users' views of the response generated by the LLM often differs greatly from the opinion of the person actually interacting with the model

this is particularly evident with image generation, but I think it's true across the board. for example, you may think something I created on midjourney "looks amazing", whereas I may dislike it because it's so far from what I had in mind and was actually trying to accomplish when I was sending in my prompt

Your last paragraph is true regardless of how the image was generated.

One can find anything YOU produce to have different qualities from you.

True, but generally what art I produce IRL is objectively terrible, whereas I can come up with some pretty nice looking images on Midjourney.... which are still terrible to me when I wanted them to look like something else, but others may find them appealing because they don't know how I've failed at my objective

In other words, there are two different objectives in a "drawing": (1) portraying that which I meant to portray and (2) making it aesthetically appealing

People who only see the finished product may be impressed by #2 and never consider how bad I was at #1

Right now they're basically a improved search engine, but they aren't solving the hard problem of making money.

Had Google become a utility and frozen it's search engine half a decade or more in the past, we would actually have something you could add AI on top of and come out with an improved product.

As it stands, capitalism isn't going to fix GIGO with AI

> LLMs are not accurate, they aren't subject matter experts that'll be maybe within 5% error margin.

The Gell Mann amnesia effect suggests people will have a very hard time noticing the difference. Even if the models never improve, they're more accurate than a lot of newspaper reporting.

> People will gradually learn and discover anf the cost of keeping a model updated and running won't drastically reduce so we'll most likely see dust settling down.

So, you're betting on no significant cost reduction of compute hardware? Seems implausible to me.

> …they’re more accurate than a lot of newspaper reporting.

Is that when they’re cribbing straight out of the newspaper pages, or is this just a cynical snipe at the poor state of media that, not for nothing, tech companies have had a fair hand in kneecapping?

The criticism of the performance of newspapers goes back well before Lovelace and Babbage:

"""I will add, that the man who never looks into a newspaper is better informed than he who reads them; inasmuch as he who knows nothing is nearer to truth than he whose mind is filled with falsehoods & errors. He who reads nothing will still learn the great facts, and the details are all false."""

- Thomas Jefferson (not Mark Twain), 1807, https://www.snopes.com/fact-check/mark-twain-read-newspaper-...

> So, you're betting on no significant cost reduction of compute hardware? Seems implausible to me.

This is not about compute, but about data.

https://arxiv.org/abs/2404.04125

"...our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found."