Hacker News new | ask | show | jobs
by luplex 1132 days ago
At least in the field of computer vision, there seems to be lots of algorithmic progress too. The algorithms improve every 9 months by an amount equivalent to a doubling of compute budget.

https://epochai.org/blog/revisiting-algorithmic-progress

2 comments

The bitter lesson isn't really "algorithms bad", "don't try different approaches", "don't innovate" or "only work on models with massive compute".

The heart of the bitter lesson is "don't try to codify "insight" into the process". It's basically the age old "you don't know what you don't know".

The Transformer is kind of a perfect example. It boasts algorithmic improvements over RNNs and LLMs are by far the best performing take on language modelling ever. And yet the architecture itself has basically no breakthrough from understanding language itself. It's an improvement over standard RNNs but not really because of any new found insight or implementation on language itself.

Basically trying to cram human high level instincts/insights into the process of solving a problem doesn't work better than giving a general architecture tons of data and letting it figure that all out by itself.

> The heart of the bitter lesson is "don't try to codify "insight" into the process".

This is exactly right and what a lot of people get wrong. Sutton isn't saying that you can't have constraints in your network either. He also isn't saying "no need to learn math", which is a far too common interpretation I've seen. It isn't just data and scale, algorithms are critical too. Just don't force aspects like Gabor filters, symmetry, etc. This doesn't mean works like geometric deep learning are dead (alpha fold even uses it!). The reason to not force insights is because they sometimes don't hold in high dimensions and sometimes our assumptions are wrong. It can also limit the path to reach the optimal/desired solution even if the optimal solution has those constraints. But I am specifically saying "force" because we can hint and we are always using some human insight.

I'd argue it's even "you don't know what you do know." We cannot codify what we don't understand, and while we understand and can verbalize some parts of our thinking, others, maybe even the great majority, are hidden from us. We just get a feeling.
LLM’s do use human ”insight” into language with how they require tokenized inputs and outputs.

It’s one of those insights that seems obvious after the fact but really wasn’t.

That could count I suppose but I don't think that's really the kind of insight Sutton is alluding to in his original writing. Insight in this case would be more like shoehorning one of the processes humans would use to solve the problem. There are no innate grammar rules the architecture looks to before each attempt, no tree or word search. Things like that.

Polishing the input in that way is neat but it's not like you can't go character or word level for a transformer. The current way is just far more compute efficient but the Transformer will figure out the seq to seq all the same.

It doesn’t just polish the input. Tokenizing the output also significantly reduces the risk of gibberish especially if you do a grammar pass to ensure tense matches etc. It means a model with a much worse understanding of the language can preform better than something operating on raw characters.
Fair, I didn't mean to dismiss the impact of tokenization as such.

But tokenization is still a process that's figured by another DL model. Human "insight" doesn't produce tokenization as it does. Another model trained on [insert language(s)] text figures out how best to break sentences into token parts.

That said, these things are a spectrum. I don't think, "no tips from biology whatsoever" or "no constraints at all" is really what Sutton had in mind. The less of it the better is the general idea.

Good point. I find it really reminiscent of how Alpha Zero ignored essentially all human knowledge about chess play, but still depended on insights into chess AI / search algorithms.

I think of deep neural networks as replicating long term memory/reflex rather than thought. I don’t know if that’s quite it, but they excel at a lot of very difficult AI problems when paired with just a tiny bit of handholding. Some of that might go away with even more compute, but I think approaching AGI is going to take more than just even more compute.

> Basically trying to cram human high level instincts/insights into the process of solving a problem doesn't work better than giving a general architecture tons of data and letting it figure that all out by itself.

Hi, programmer from outside ML here. You might be able to answer something I've been wandering about.

I do remember things like NLTK and logical inference many years ago. I understand the current tech is all large language models and (as you put it) the model figures out the rules.

Sometimes I get responses from ChatGPT that seem like they wouldn't pass logical inference. I will think "all the foos aren't capable of X, bar is an instance of foo, stop suggesting bar to do X". Is there room for old-school logical inference as a kind of sanity-check layer on top of LLMs?

I wonder if they'll end up with specialized subunits for different processing tasks, like the old "lizard brain" model with the neocortex on top of other layers:

https://en.wikipedia.org/wiki/Triune_brain

Nothing wrong with that at all. Could be a viable solution for specific use-cases. But for know, most researchers will focus on innately improving those abilities. Right now that would mostly be by increasing scale (data or parameter size), highly curated data for the specific deficiency or work on making transformers scale more efficiently. after all, GPT-4 is much better at logical reasoning than 3.5 and we still haven't hit a functional limit on scaling transformers.
But "don't try to codify 'insight' into the process" seems to suggest "don't try different approaches". I'm not sure how people can at once trot out the "Bitter Lesson" and interpret it as it is written, but still say "We're not saying not to think about new approaches".

Is the idea then to work only on methods that allow for faster compute of more data?

FWIW, the Transformer works faster on current methods of parallelisation, allowing for dramatic scaling that RNNs will find hard to compete on. But we do pay for that in terms of what can be computed (https://arxiv.org/pdf/2207.00729.pdf - TL;DR: Transformers are limited in the types of programs/functions it can compute because of parallelism).

Scaling, ironically, does seem to be the 'direction of steepest descent' in terms of what will bring the best performance (for now). Gradient descent does find pleasant local optima that may keep us happy for a while.

As far as approach is concerned, all the bitter lesson advises against is trying to shoehorn human high level processes into the architecture. There's still plenty of room for different approaches outside of just faster compute.

CNNs and Transformers are very different. Both can be used for computer vision. The bitter lesson wouldn't stop you from switching from one to the other.

The scope of "what to try" is large, we (as a community) should prioritise things that we think would work. If the criteria is not only "faster compute" it would seem "things that mimic human high level processes" would be a good candidate.

We started with MLPs then CNNs were invented, and that brought on pretty large gains. Arguably CNNs are architectures inspired by "human high level processes".

Edit: I will say though, this is a new take on the nuance of "Bitter Lesson" that I've never heard, though even this interpretation I find to be strangely contradictory for the reasons above.

>it would seem "things that mimic human high level processes" would be a good candidate.

That's the natural intuition yes. But I believe Sutton's point is that this very intuition seems to prove itself wrong in the long term.

The way I see it, the problem with the high level is that we don't actually know shit. If we knew so completely what it took to model language or vision in the first place, we wouldn't need deep learning at all.

It seems intuitive that trying to bake in some basic grammar rules might speed things up along.

Problem with that is that we often end up overfitting the models to those specific rules and constraints, limiting its ability to generalize and learn more complex and underlying patterns and structures in language. Patterns that we don't actually know of.

The low level processes result in the high level performance but not vice versa.

It's said that the one human neuron is equivalent to a CNN. I wouldn't really call the operations of neurons high level though.

Right. So where I end up on this, given the examples of intuitions that DO work, is it's always the _right_ levels of prior knowledge that's needed. The intuitions on language (encoding basic grammar) didn't pan out, but the one for vision did (CNNs). What further levels of intuition could we use to improve even the large language models?

That, of course, requires experimentation. If it's not speeding up scaling (of course this should be done), and it's not mimicking human cognition (Bitter Lesson says no), what do you decide to try? I guess I'm missing what other heuristics there are to use here.

Just looking at the current state of where NLP is going: Prompt engineering and its various 'step-by-step' siblings are all pretty high-level human cognition motivated to me. Shouldn't that go against the bitter lesson as well?

"The Bitter Lesson" feels like an article that was written at a time when the intuitions that went into deep learning have become common-place, and scaling things up get a lot of leverage out of the 'insights' that came before. Once the returns have diminished to a point of saturation, the 'insights' will likely once again be useful, until methods to scale catch up once again, and "The Bitter Lesson 2.0" will be making the rounds.

Bingo!

That is the bitter lesson.

Thank you for posting this here!

Also, while it gets lost in the foundation model stuff, a major trend in computer vision is toward smaller, high quality datasets. Arguably CV had its V1 llm moment years ago with models trained on imagenet, which produced amazing general results but weren't good enough for much specific stuff.

If you look at what, e.g. Andrew Ng was talking about last year, there was a big emphasis on "small data" and getting good datasets.