| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adw 827 days ago
	Yeah, all of those architectures are _themselves_ hacks to get around having insufficient compute! They absolutely were encoding inductive biases into the network to get around not being able to train enough, and transformers (handwaving hard enough to levitate, the currently-trainable model family with the least inductive bias) have eaten the world in all domains. This is evidence _for_ the Bitter Lesson, not against it.

1 comments

YeGoblynQueenne 827 days ago

They haven't (eaten the world etc). They just happen to be the models that trend hard right now. I bet if you could compare like for like you'd be able to see some improvement in performance from Transformers, but that 'd be extremely hard to separate from the expected improvement from the constantly increasing amounts of data and compute. For example, you could, today, train a much bigger and deeper Multi-Layered Perceptron than you could thirty years ago, but nodoy is trying because that's so 1990's, and in any case they have the data and compute to train much bigger, much more inefficient (contrary to what you say if I got that right) architectures.

Wait a few years and the Next Big Thing in AI will come along, hot on the heels of the next generation of GPUs, or tensor units or whatever the hardware industry can cook up to sell shovels for the gold rush. By then, Transfomers will have hit the plateau of diminishing returns, there'll be gold in them there other hills and nobody would talk of LLMs anymore because that's so 2020s. We've been there so many times before.

link

adw 827 days ago

> much more inefficient

The tricky part here is that "efficiency" is not a single dimension! Transformers are much more "efficient" in one sense, in that they appear to be able to absorb much more data before they saturate; they're in general less computationally efficient in that you can't exploit symmetries as hard, for example, at implementation time.

Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

Fully-connected NNs are more general than transformers, but they have _so many_ degrees of freedom that the numerical optimization problem is impractical. If someone figures out how to stabilize that training and make these implementable on current or future hardware, you're absolutely right that you'll see people use them. I don't think transformers are magic; you're entirely correct in saying that they're the current knee on the implementability/trainability curve, and that can easily shift given different unit economics.

I think one of the fundamental disconnects here is that people who come at AI from the perspective of logic down think of things very differently to people like me who come at it from thermodynamics _up_.

Modern machine learning is just "applications of maximum entropy", and to someone with a thermodynamics background, that's intuitively obvious (not necessarily correct! just obvious) –in a meaningful sense the _universe_ is a process of gradient descent, so "of course" the answer for some local domain models is maximum-entropy too. In that world view, the higher-order structure is _entirely emergent_. I'm, by training, a crystallographer, so the idea that you can get highly regular structure emerging from merciless application of a single principle is just baked into my worldview very deeply.

Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

link

YeGoblynQueenne 827 days ago

>> Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

I don't know about that, I'll be honest. Do you have a reference? I suspect it won't disagree with what I'm saying, that neural nets just can't use strong enough bias to avoid overfitting. I didn't say that in so many words, above, but that's the point of having a good inductive bias, that you're not left, as a learner, to the mercy of the data.

>> Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

No that's absolutely a standard assumption in logic :) Think of grammars; like Chomsky likes to say, human language "makes infinite use of finite means" (quoting Wilhelm von Humboldt). Chomsky of course believes that human language is the result of a simple set of rules, very much like logical theories. Personally, I have no idea, but Chomsky consistently and even today pisses off all the linguists and all the machine learning people, so he must be doing something right.

Btw, I'm not coming from the perspective of mathematical logic, only. It's complicated, but, e.g. my MSc was in data science and my PhD in a symbolic form of machine learning. See, learning and logic, or learning and reasoning, are not incompatible, they're fundamentally the same.

link

adw 827 days ago

> They haven't (eaten the world etc).

To clarify what I mean on this specific bit: the SOTA results in 2D and 3D vision, audio, translation, NLP, etc are all transformers. Past results do not necessarily predict future performance, and it would be absurd to claim that an immutable state of affairs, but it's certainly interesting that all of the domain-specific architectures have been flattened in a very short period of time.

link

YeGoblynQueenne 827 days ago

Thanks for clarifying. Well, my argument is that the state of the art is more the result of trends in research than of the true capabilities of different approaches.

Take my little rant about Rich Sutton's (a god, btw) Bitter Lesson with respect to RL. So, there's AlphaGo, AlphaZero and μZero, yes? AlphaGo knows the rules of Go and starts with some expert knowledge, and beats very human Go player. AlphaZero knows the rules of Go but has no expert knowledge and it beats AlphaGo. And μZero neither knows the rules of Go, nor has expert knowledge, and it beats AlphaZero, and can also plays chess, shoggi and Atari games, with one hand while eating a banana. Do you know how hard it is to eat a banana with one hand? Unpeeled!

Easy to draw a conclusion from that. Except all those systems were developed and used by DeepMind, and there are very few entities besides DeepMind that can even train them, so all we know is what DeepMind claims and we have no way to check their claims. For example, can I test different configurations of μZero, with and without knowledge of the rules of the game and expert knowledge? Not really. And it's clear to me that DeepMind are pushing very, very hard a form of AI that relies on having gigantic resources, like the ones the just completely coincidentally happen to be among the few entities to have access to. So I remain unconvinced.

(I need to re-read the μZero paper, it's in my pdf buffer. I didn't get it the first time I read it, and it might well be that they did make sufficient ablation studies to convince even me and I just don't remember it).

link