Hacker News new | ask | show | jobs
by cs702 1124 days ago
Quoting Rich Sutton, who wrote the perfect response some years ago:

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation."[a]

The very smart folks at the Alan Turing Institute are learning firsthand how bitter the lesson can be.

---

[a] http://incompleteideas.net/IncIdeas/BitterLesson.html

7 comments

I think the problem is actually worse than the article implies, because there are two things being leveraged here: compute and data.

Short-term improvements made by domain-specific AI result in better outputs than more general AIs, ceteris paribus. But these better outputs can then later be fed back into more powerful general purpose AIs, and consuming the data*compute product from the domain-specific models is a very effective way to train domain-specific behavior.

Today, we see this in reverse – people are training smaller models based on outputs from GPT-4. However, I expect that we'll start to see more and more training going the opposite direction in the future: Domain-specific generative models will be used to build scenarios for large general-purpose AIs to train against.

Here's a concrete example – image diffusion models are really bad at physics, so you can't tell one to draw a person upside down, because it's not well-represented in the dataset, and if you force it to with something like controlnet you typically get a disfigured and horrific image. So obviously diffusion models are not the best long-term solution for image generation. But how do you get this concept of "upside down" into an AI model? Well, maybe you add some kind of neat segmentation technique that involves using several diffusion models and rotating and stitching together their outputs. Great, you made an upside-down generator.

Now, you generate 100,000 images of "upside down" people, and the next advance in image generation AI can come along and learn that concept with ease thanks to the larger data set that it has.

So it's not just that "more compute wins", it's more like: not only does more compute wins, it wins even more because short-term improvements feed directly into the data pipeline that enables it to win.

Along that line re Moore's Law, the biggest clear advantage the US has in regards to AI, is: Nvidia, AMD, Intel (and obviously particularly Nvidia up to this point, although AMD and Intel are producing some potent GPUs).

The reason the US was able to pull off a leap forward via OpenAI or LLaMa, is due to having Nvidia as basically a national treasure, but it's an integrated whole, the US has all the components necessary and the ecosystem that produces all the components (including talent, thought process, money, start-ups, pay scale (to lure talent)).

The Europeans have never lacked for the brains side of being able to do it, certainly. Until they fill out the rest of the ecosystem they won't be able to really compete in AI (they'll lag far behind, with lots of blaming and empty promises big government projects). And China is its own worst enemy these days, all we need is for Xi to remain in power indefinitely and he'll throttle their potential as a global competitor.

The US is close to locking up another round in the tech wars, riding on the same approach that has served it so well since WW2. Hopefully our do-something legislators are hands off long enough (ie don't snatch defeat from the jaws of victory).

And data!! Don't forget that all the large accumulators of raw data available for commercially-supported research is in the US. A brutal combo of all 3: hardware data and competence available in one country. China has two of the 3, and until recently had free access to hardware, too. Europeans lack data and hardware.
Deepmind was literally founded in the UK and is headquartered in London.

Comercially supported research is (in my opinion) relatively OK in Europe, its just that the big US tech companies are so big they'll just buy you if you do something sufficiently interesting.

> including talent, thought process, money, start-ups, pay scale (to lure talent)

...and a culture of risk-taking and ambition to change the world.

Yeah I tried to cover that with the start-ups implication. The US has that in spades in regards to aggressively funding start-ups by the zillions when a new tech inflection hits (whether software, Web, mobile, cloud and now AI). The VCs always go overboard, which is ideal (billions in destroyed capital is meaningless compared to producing the next Nvidia, Google, Amazon, Microsoft, etc).
Couldn't agree more!
But Nvidia’s GPUs are sold all over the world … how is that an advantage to the Americans?
Going forward the top AI chips will not be available in every country — particularly China.
At least in the field of computer vision, there seems to be lots of algorithmic progress too. The algorithms improve every 9 months by an amount equivalent to a doubling of compute budget.

https://epochai.org/blog/revisiting-algorithmic-progress

The bitter lesson isn't really "algorithms bad", "don't try different approaches", "don't innovate" or "only work on models with massive compute".

The heart of the bitter lesson is "don't try to codify "insight" into the process". It's basically the age old "you don't know what you don't know".

The Transformer is kind of a perfect example. It boasts algorithmic improvements over RNNs and LLMs are by far the best performing take on language modelling ever. And yet the architecture itself has basically no breakthrough from understanding language itself. It's an improvement over standard RNNs but not really because of any new found insight or implementation on language itself.

Basically trying to cram human high level instincts/insights into the process of solving a problem doesn't work better than giving a general architecture tons of data and letting it figure that all out by itself.

> The heart of the bitter lesson is "don't try to codify "insight" into the process".

This is exactly right and what a lot of people get wrong. Sutton isn't saying that you can't have constraints in your network either. He also isn't saying "no need to learn math", which is a far too common interpretation I've seen. It isn't just data and scale, algorithms are critical too. Just don't force aspects like Gabor filters, symmetry, etc. This doesn't mean works like geometric deep learning are dead (alpha fold even uses it!). The reason to not force insights is because they sometimes don't hold in high dimensions and sometimes our assumptions are wrong. It can also limit the path to reach the optimal/desired solution even if the optimal solution has those constraints. But I am specifically saying "force" because we can hint and we are always using some human insight.

I'd argue it's even "you don't know what you do know." We cannot codify what we don't understand, and while we understand and can verbalize some parts of our thinking, others, maybe even the great majority, are hidden from us. We just get a feeling.
LLM’s do use human ”insight” into language with how they require tokenized inputs and outputs.

It’s one of those insights that seems obvious after the fact but really wasn’t.

That could count I suppose but I don't think that's really the kind of insight Sutton is alluding to in his original writing. Insight in this case would be more like shoehorning one of the processes humans would use to solve the problem. There are no innate grammar rules the architecture looks to before each attempt, no tree or word search. Things like that.

Polishing the input in that way is neat but it's not like you can't go character or word level for a transformer. The current way is just far more compute efficient but the Transformer will figure out the seq to seq all the same.

It doesn’t just polish the input. Tokenizing the output also significantly reduces the risk of gibberish especially if you do a grammar pass to ensure tense matches etc. It means a model with a much worse understanding of the language can preform better than something operating on raw characters.
Fair, I didn't mean to dismiss the impact of tokenization as such.

But tokenization is still a process that's figured by another DL model. Human "insight" doesn't produce tokenization as it does. Another model trained on [insert language(s)] text figures out how best to break sentences into token parts.

That said, these things are a spectrum. I don't think, "no tips from biology whatsoever" or "no constraints at all" is really what Sutton had in mind. The less of it the better is the general idea.

> Basically trying to cram human high level instincts/insights into the process of solving a problem doesn't work better than giving a general architecture tons of data and letting it figure that all out by itself.

Hi, programmer from outside ML here. You might be able to answer something I've been wandering about.

I do remember things like NLTK and logical inference many years ago. I understand the current tech is all large language models and (as you put it) the model figures out the rules.

Sometimes I get responses from ChatGPT that seem like they wouldn't pass logical inference. I will think "all the foos aren't capable of X, bar is an instance of foo, stop suggesting bar to do X". Is there room for old-school logical inference as a kind of sanity-check layer on top of LLMs?

I wonder if they'll end up with specialized subunits for different processing tasks, like the old "lizard brain" model with the neocortex on top of other layers:

https://en.wikipedia.org/wiki/Triune_brain

Nothing wrong with that at all. Could be a viable solution for specific use-cases. But for know, most researchers will focus on innately improving those abilities. Right now that would mostly be by increasing scale (data or parameter size), highly curated data for the specific deficiency or work on making transformers scale more efficiently. after all, GPT-4 is much better at logical reasoning than 3.5 and we still haven't hit a functional limit on scaling transformers.
But "don't try to codify 'insight' into the process" seems to suggest "don't try different approaches". I'm not sure how people can at once trot out the "Bitter Lesson" and interpret it as it is written, but still say "We're not saying not to think about new approaches".

Is the idea then to work only on methods that allow for faster compute of more data?

FWIW, the Transformer works faster on current methods of parallelisation, allowing for dramatic scaling that RNNs will find hard to compete on. But we do pay for that in terms of what can be computed (https://arxiv.org/pdf/2207.00729.pdf - TL;DR: Transformers are limited in the types of programs/functions it can compute because of parallelism).

Scaling, ironically, does seem to be the 'direction of steepest descent' in terms of what will bring the best performance (for now). Gradient descent does find pleasant local optima that may keep us happy for a while.

As far as approach is concerned, all the bitter lesson advises against is trying to shoehorn human high level processes into the architecture. There's still plenty of room for different approaches outside of just faster compute.

CNNs and Transformers are very different. Both can be used for computer vision. The bitter lesson wouldn't stop you from switching from one to the other.

The scope of "what to try" is large, we (as a community) should prioritise things that we think would work. If the criteria is not only "faster compute" it would seem "things that mimic human high level processes" would be a good candidate.

We started with MLPs then CNNs were invented, and that brought on pretty large gains. Arguably CNNs are architectures inspired by "human high level processes".

Edit: I will say though, this is a new take on the nuance of "Bitter Lesson" that I've never heard, though even this interpretation I find to be strangely contradictory for the reasons above.

>it would seem "things that mimic human high level processes" would be a good candidate.

That's the natural intuition yes. But I believe Sutton's point is that this very intuition seems to prove itself wrong in the long term.

The way I see it, the problem with the high level is that we don't actually know shit. If we knew so completely what it took to model language or vision in the first place, we wouldn't need deep learning at all.

It seems intuitive that trying to bake in some basic grammar rules might speed things up along.

Problem with that is that we often end up overfitting the models to those specific rules and constraints, limiting its ability to generalize and learn more complex and underlying patterns and structures in language. Patterns that we don't actually know of.

The low level processes result in the high level performance but not vice versa.

It's said that the one human neuron is equivalent to a CNN. I wouldn't really call the operations of neurons high level though.

Bingo!

That is the bitter lesson.

Thank you for posting this here!

Also, while it gets lost in the foundation model stuff, a major trend in computer vision is toward smaller, high quality datasets. Arguably CV had its V1 llm moment years ago with models trained on imagenet, which produced amazing general results but weren't good enough for much specific stuff.

If you look at what, e.g. Andrew Ng was talking about last year, there was a big emphasis on "small data" and getting good datasets.

Funny plot twist: the pioneer on leveraging computation on neural networks is actually a British: Geoffrey Hinton, living in Canada.

Btw, Rich Sutton was born in the U.S. but renounced his American citizenship, becoming Canadian.

And a MORE funny story is,(according to an coworker of mine, whose PhD supervisor was a friend to Hinton) that when Hinton was looking for university to conduct his work in, he was rejected by the UK universities which was his first choices. So ended up in Canada.

So the plot twist comes with a bit of irony!

Yes, Hinton was on a temporary position at the University of Sussex (IIRC, the Centre for Cognitive Science) for a while, but was not offered a permanent academic position there when he applied.
Also, the history of the the underlying advances is a lot more international than current popular telling of the history lets on. See eg https://people.idsia.ch/~juergen/scientific-integrity-turing...
As someone with no knowledge of the fields of machine learning and artificial intelligence, I wonder how much of recent AI stuff is due to "true" Moore's Law (GPUs and such getting much faster/cheaper), and how much is due to the data version of Moore's Law (web-scale data farming/storage to "teach" LLMs and such).
Making use of more data requires more compute (e.g., longer training, more powerful hardware, or both).
At this point I'm not sure progress is due to Moore's Law as stated originally (cheaper compute) than it is due to companies just spending more on compute. Effect is the same for now, but with a clear limit.
Let’s be honest, though—very, very few people expected large language models to be so ungodly effective.
Perhaps many many many more people would have made the bet on LLM versus NFTs or virtue signalling with “Data as an instrument of coloniality: A panel discussion on digital and data colonialism”, right?
is this really true?
The EU+UK need a project like the large hadron collider but for AI: Develop a really really large computational infrastructure that allows researchers to study AI experiments with technology that may be 20 or 30 years far away from being commoditized.
They are doing that for quantum computing. But 20 years is about right.

It seems like Netherlands completely missed the boat on LLMs, too… but I don’t blame them. I just hope they pivot quickly.