| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by RyanShook 96 days ago
	Here's where I'm missing understanding: for decades the idea of neural networks had existed with minimal attention. Then in 2017 Attention Is All You Need gets released and since then there is an exponential explosion in deep learning. I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier.

13 comments

pash 96 days ago

The inflection point was 2012, when AlexNet [0], a deep convolutional neural net, achieved a step-change improvement in the ImageNet classification competition.

After seeing AlexNet’s results, all of the major ML imaging labs switched to deep CNNs, and other approaches almost completely disappeared from SOTA imaging competitions. Over the next few years, deep neural networks took over in other ML domains as well.

The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.

The development of “attention” was particularly valuable in learning complex relationships among somewhat freely ordered sequential data like text, but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. The “bitter lesson” [1] is that more compute and more data eventually beats better models that don’t scale.

Consider this: humans have on the order of 10^11 neurons in their body, dogs have 10^9, and mice have 10^7. What jumps out at me about those numbers is that they’re all big. Even a mouse needs hundreds of millions of neurons to do what a mouse does.

Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment. (Mice and men both exist in the same physical reality.)

On the other hand, we know many simple techniques with low parameter counts that work well (or are even proved to be optimal) on simple or stylized problems. “Learning” and “intelligence”, in the way we use the words, tends to imply a complex environment, and complexity by its nature requires a large number of parameters to model.

0. https://en.wikipedia.org/wiki/AlexNet

1. https://en.wikipedia.org/wiki/Bitter_lesson

musebox35 96 days ago

Thanks for posting a through and accurate summary of the historical picture. I think it is important to know the past trajectory to extrapolate to the future correctly.

For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.

coppsilgold 96 days ago

Comparing Deep Learning with neuroscience may turn out to be erroneous. They may be orthogonal.

The brain likely has more in common with Reservoir Computing (sans the actual learning algorithm) than Deep Learning.

Deep Learning relies on end to end loss optimization, something which is much more powerful than anything the brain can be doing. But the end-to-end limitation is restricting, credit assignment is a big problem.

Consider how crazy the generative diffusion models are, we generate the output in its entirety with a fixed number of steps - the complexity of the output is irrelevant. If only we could train a model to just use Photoshop directly, but we can't.

Interestingly, there are some attempts at a middle ground where a variable number of continuous variables describe an image: <https://visual-gen.github.io/semanticist/>

jvanderbot 96 days ago

If you think a 2 year old is doing deep learning, you're probably wrong. But if you think natural selection was providing end to end loss optimization, you might be closer to right. An _awful lot_ of our brain structure and connectivity is born, vs learned, and that goes for Mice and Men.

ACCount37 95 days ago

Why not both? A pre-trained LLM has an awful lot of structure, and during SFT, we're still doing deep learning to teach it further. Innate structure doesn't preclude deep learning at all.

There's an entire line of work that goes "brain is trying to approximate backprop with local rules, poorly", with some interesting findings to back it.

Now, it seems unlikely that the brain has a single neat "loss function" that could account for all of learning behaviors across it. But that doesn't preclude deep learning either. If the brain's "loss" is an interplay of many local and global objectives of varying complexity, it can be still a deep learning system at its core. Still doing a form of gradient descent, with non-backpropagation credit assignment and all. Just not the kind of deep learning system any sane engineer would design.

imtringued 95 days ago

I don't know what you mean by end to end loss optimization in particular, but if you mean something that involves global propagation of errors e.g. backpropagation you are dead wrong.

Predictive coding is more biologically plausible because it uses local information from neighbouring neurons only.

sdenton4 95 days ago

By end to end loss optimization, they mean evolution: Try a thing, and see if it dies or reproduces more. Repeat until moon landing.

ACCount37 95 days ago

Modern systems like Nano Banana 2 and ChatGPT Images 2.0 are very close to "just use Photoshop directly" in concept, if not in execution.

They seem to use an agentic LLM with image inputs and outputs to produce, verify, refine and compose visual artifacts. Those operations appear to be learned functions, however, not an external tool like Photoshop.

This allows for "variable depth" in practice. Composition uses previous images, which may have been generated from scratch, or from previous images.

roenxi 95 days ago

> If only we could train a model to just use Photoshop directly, but we can't.

It is probably coming, I get the impression - just from following the trend of the progress - that internal world models are the hardest part. I was playing with Gemma 4 and it seemed to have a remarkable amount of trouble with the idea of going from its house to another house, collecting something and returning; starting part-way through where it was already at house #2. It figured it out but it seemed to be working very hard with the concept to a degree that was really a bit comical.

It looks like that issue is solving itself as text & image models start to unify and they get more video-based data that makes the object-oriented nature of physical reality obvious. Understanding spatial layouts seems like it might be a prerequisite to being able to consistently set up a scene in Photoshop. It is a bit weird that it seems pulling an image fully formed from the aether is statistically easier than putting it together piece by piece.

vunderba 95 days ago

> If only we could train a model to just use Photoshop directly, but we can't.

They're obviously more general purpose but LLMs can also be used to drive external graphics programs. A relatively popular one is Blender MCP [1], which lets an LLM control Blender to build and scaffold out 3D models.

[1] - https://github.com/ahujasid/blender-mcp

antonvs 95 days ago

> If only we could train a model to just use Photoshop directly, but we can't.

What kind of sadist would wish this on an intelligent entity?

sdenton4 95 days ago

Yeah, that's how you get skynet.

cdavid 95 days ago

Indeed. I would add a third factor to compute and datasets: the lego-like aspect of NN that enabled scalable OSS DL frameworks.

I did some ML in mid 2000s, and it was a PITA to reuse other people code (when available at all). You had some well known libraries for SVM, for HMM you had to use HTK that had a weird license, and otherwise looking at experiments required you to reimplement stuff yourself.

Late 2000s had a lot of practical innovation that democratized ML: theano and then tf/keras/pytorch for DL, scikit learn for ML, etc. That ended up being important because you need a lot of tricks to make this work on top of "textbook" implementation. E.g. if you implement EM algo for GMM, you need to do it in the log space to avoid underflow, DL as well (gorot and co initialization, etc.).

jesseab 95 days ago

Remember watching Alec Radford's Theano tutorial and feeling like I had found literal gold.

alasdair_ 95 days ago

I think your post may have more acronyms than any other post I have ever read on hn. Do you have a guide to which specific things you are talking about with each acronym? Deep Learning and Machine Learning are obvious but some of the others I can’t follow at all - they could be so many different things.

AgentMatt 95 days ago

NN - neural networks OSS DL frameworks - open source deep learning frameworks

PITA - pain in the ass

SVM - support vector machines HMM - hidden Markov model EM - expectation maximization GMM - gaussian mixture model HTK - hidden Markov model tool kit

ButlerianJihad 95 days ago

I think he maintains pinball machines and jukeboxes for a chain of Greek restaurants

cdavid 94 days ago

fair, somebody else clarified already !

Sohakes 96 days ago

> but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.

I feel like you are downplaying the importance of architecture. I never read the bitter lesson, but I have always heard more as a comment on embedding knowledge into models instead of making them to just scale with data. We know algorithmic improvement is very important to scale NNs (see https://www.semanticscholar.org/paper/Measuring-the-Algorith...). You can't scale an architecture that has catastrophic forgetting embedded in it. It is not really a matter of tradeoffs, some are really worse in all aspects. What I agree is just that architectures that scale better with data and compute do better. And sure, you can say that smaller architectures are better for smaller problems, but then the framing with the bitter lesson makes less sense.

hodgehog11 96 days ago

> Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment.

Real intelligence deals with information over a ludicrous number of size scales. Simple models effectively blur over these scales and fail to pull them apart. However, extra compute is not enough to do this effectively, as nonparametric models have demonstrated.

The key is injecting a sensible inductive bias into the model. Nonparametric models require this to be done explicitly, but this is almost impossible unless you're God. A better way is to express the bias as a "post-hoc query" in terms of the trained model and its interaction with the data. The only way to train such a model is iteratively, as it needs to update its bias retroactively. This can only be accomplished by a nonlinear (in parameters) parametric model that is dense in function space and possesses parameter counts proportional to the data size. Every model we know of that does this is called "a neural network".

mystraline 96 days ago

Ive yet to see a model that trains AND applies the trained data real-time. Thats basically every living being, from bacteria to plants to mammals.

Even PID loops have a training phase separate from recitation phase.

seanhunter 95 days ago

That’s not a meaningful technical obstacle. If you wanted to, you could just take the output of the model and use it at each iteration of the training phase to perform (badly) whatever task the model is intended to do.

The reason noone does this is you don’t have to and you’ll get much better results if you first fully train and then apply the best model you have to whatever problem. Biological systems don’t have that luxury.

robotresearcher 95 days ago

Reinforcement learning on real robots in real time has been done lots of times, since back in the 90s at least. It’s painfully slow.

mystraline 95 days ago

Why is it slow?

We know a human uses roughly 100 watts. And teaching a new specific task takes only showing maybe 10 times to get to 80%.

The learning function in humans are definitely connected with both training/recitation.

I'm seeing that as the big roadblock between thinking machines and a really big autocomplete we have now.

getnormality 96 days ago

> I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.

Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?

musebox35 96 days ago

I think this question is one of the more concrete and practical ways to attack the problem of understanding transformers. Empirically the current architecture is the best to converge training by gradient descent dynamics. Potentially, a different form might be possible and even beneficial once the core learning task is completed. Also the requirements of iterated and continuous learning might lead to a completely different approach.

etiam 96 days ago

Did you see this one?

https://news.ycombinator.com/item?id=41732853

Someone 95 days ago

> Even a mouse needs hundreds of millions of neurons to do what a mouse does.

Under the very light assumption that a mouse doesn’t have neurons it doesn’t need, a mouse needs whatever number of neurons it has to do what a mouse does, so that’s not saying much.

Reading https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n..., an ant has only 250k neurons and many reptiles can do with around 10 million.

That page also says 71 million for the house mouse. So what is it that a mouse does that reptiles do not do that requires them to have that much larger a brain? Caring for their children?

tim333 95 days ago

Mice seem to have quite a good representation of the 3d environment around them and motor skills. I had one in my flat run off an jump through an approx 1 x 2 inch hole 6 inches off the ground and about 10 inches from where it jumped from. Humans would probably have a job with that and I've not seen a lizard say seem to have similar ability to know its way around.

I daresay I don't think animals actually need some number or neurons. There's probably just a trade off between more giving better results versus being heavier and more energy consuming.

HappMacDonald 94 days ago

Mice do a hell of a lot more socialization than lizards, and mammalian socialization is more complex per individual (more competition, feinting, theory-of-mind-like strategies) than the eusocial insect strategies of "my body is the swarm, I just happen to be the limb I have direct control over".

sdenton4 95 days ago

Speed may be a factor - reptiles and mice live their lives at very different paces.

tbrownaw 96 days ago

> The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.

I'd thought it was some issue with training where older math didn't play nice with having too many layers.

etiam 96 days ago

Sigmoid-type activation functions were popular, probably for the bounded activity and some measure of analogy to biological neuron responses. They work, but get problematic scaling of gradient feedback outside their most dynamic span.

My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.

One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.

cgearhart 96 days ago

A much earlier major win for deep learning was AlexNet for image recognition in 2012. It dominated the competition and within a couple years it was effectively the only way to do image tasks. I think it was Jeremy Howard who wrote a paper around 2017 wondering when we’d get a transfer learning approach that worked as well for NLP as convnets did for images. The attention paper that year didn’t immediately dominate. The hardware wasn’t good enough and there wasn’t consensus on belief that scale would solve everything. It took like five more years before GPT3 took off and started this current wave.

I also think you might be discounting exactly how much compute is used to train these monsters. A single 1ghz processor would take about 100,000,000 years to train something in this class. Even with on the order of 25k GPUs training GPT3 size models takes a couple months. The anemic RAM on GPUs a decade ago (I think we had k80 GPUs with 12GB vs 100’s of GBs on H100/H200 today) and it was actually completely impossible to train a large transformer model prior to the early 2020s.

I’m even reminded how much gamers complained in the late 2010s about GPU prices skyrocketing because of ML use.

porcoda 96 days ago

As others pointed out, the explosion of interest started with the deep convolutional networks that were applied in image problems. What I always thought was interesting was that prior to that, NNs were largely dismissed as interesting. When I took a course on them around the year 2000 that was the attitude most people took. It seems like what it took to spark renewed interest was ImageNet and seeing what you get when you have a ton of training data to throw at the problem and fast processors to help. After that the ball kept rolling with the subsequent developments around specific network architectures. In the broader community AlexNet is viewed as the big inflection point, but in the academic community you saw interest simmering a couple years earlier - I began to see more talks at workshops about NNs that weren’t being dismissed anymore, probably starting around 2008/09.

bobbruno 95 days ago

I played with NNs in the late 80's/early 90s, with little more than a copy of Hinton's paper, a PC and a C compiler. Obviously, I got no practical results. But I got the intuition of how they worked and what they could potentially do.

Cut to 2008-9,and I started to see smartphones, grid (then cloud) computing and social networks emerging. My MBA dissertation, finished in 2011, was about how that would change the world, because the requirements for meaningful AI were coming along - data and compute. The theory was already there, Hinton, LeCun, Schmidhuber,etc.

That got me back into the Data Science field, after years working in Data Engineering. Too bad I lived in Brazil back then and couldn't find a way to join the emerging scene in California and other top places. I'd be rich now...

srean 96 days ago

> NNs were largely dismissed

I agree with your larger point but dismissed is rather too strong. They were considered fiddly to train, prone to local minima, long training time, no clear guidelines about what the number of hidden layers and number of nodes ought to be. But for homework (toy) exercises they were still ok.

In comparison, kernel methods gave a better experience over all for large but not super large data sets. Most models had easily obtainable global minimum. Fewer moving parts and very good performance.

It turns out, however, that if you have several orders of magnitude more data, the usual kernels are too simple -- (i) they cannot take advantage of more data after a point and start twiddling the 10th place of decimal of some parameters and (ii) are expensive to train for very large data sets. So bit of a double whammy. Well, there was a third, no hardware acceleration that can compare with GPUs.

Kernels may make a comeback though, you never know. We need to find a way to compose kernels in a user friendly way to increase their modeling capacity. We had a few ways of doing just that but they weren't great. We need a breakthrough to scale them to GPT sized data sets.

In a way DNNs are "design your own kernels using data" whereas kernels came in any color you liked provided it was black (yes there were many types, but it was still a fairly limited catalogue. The killer was that there was no good way of composing them to increase modeling capacity that yielded efficiently trainable kernel machines)

energy123 95 days ago

Deepmind solving Atari games was another big milestone around that time.

whateverboat 96 days ago

The same thing happened with matrices. We had matrices for 400 years, but the field of linear algebra and especially numerical linear algebra exploded only with advent of computers.

In olden days, the correct way to solve a linear system of equations was to use theory of minors. With advent of computers, you suddenly had a huge theory of gaussian elimination, or Krylov spaces and what not.

embedding-shape 96 days ago

> I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier

But they don't give the same results at those smaller scales. People imagined, but no one could have put into practice because the hardware wasn't there yet. Simplified, LLMs is basically Transformers with the additional idea of "and a shitton of data to learn from", and for making training feasible with that amount of data, you do need some capable hardware.

BigTTYGothGF 96 days ago

The modern neural net revival got kicked off long before 2017.

noosphr 96 days ago

Alex net in 2012 is only 5 years earlier.

j_bum 95 days ago

This video gives a great overview of the history of the acceleration:

https://youtu.be/glWvwvhZkQ8?si=-HGtfd_KHYfatEQ

Although it’s focused on Ilya, some great history is covered.

tim333 95 days ago

That was interesting.

HarHarVeryFunny 96 days ago

Without fast parallel hardware there would neither have been the incentive to design the Transformer, or much benefit even if someone had come up with the design all the same!

The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.

As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.

Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!

The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.

The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".

The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.

CamperBob2 96 days ago

the concept of a transformer could have been used on much slower hardware much earlier.

It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.

LPisGood 96 days ago

> Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting

What result are you referring to?

CamperBob2 96 days ago

Haven't read the page but a promising-looking search result is here: https://seantrott.substack.com/p/perceptrons-xor-and-the-fir...

I'm sure it's an oversimplification to blame the entire 1970s AI winter on Minsky, considering they couldn't have gotten much further than the proof-of-concept stage due to lack of hardware. But his voice was a loud, widely-respected one in academia, and it did have a negative effect on the field.

antonvs 95 days ago

I suspect all Minsky did was reinforce what many people were already thinking. I experimented with neural nets in the late 80s and they seemed super interesting, but also very limited. My sense at the time was that the general thinking was, they might be useful if you could approach the number of neurons and connections in the human brain, but that seemed like a very far off, effectively impossible goal at the time.

quicklywilliam 96 days ago

Agreed, there is probably a theoretical world where we got enough money/compute together and had this explosion happen earlier.

Or perhaps a world where it happened later. I think a big part of what enabled the AI boom was the concentration of money and compute around the crypto boom.

xyhopguy 95 days ago

not really. early deep learning models were run on single consumer-grade GPUs. the inflection occured _right_ when parallel computing became fast enough to do backprop in a reasonable amount of time with performance better than tree methods.

at that time all the compute resources in the world would not have been enough to train the models from even the last ~6 years or so, probably more.

slashdave 96 days ago

Deep-learning hinges on highly redundant solution space (highly redundant weights), along with normalized weights (optimization methodology is commoditized). The original neural network work had no such concepts.

wslh 96 days ago

Don't understimate the massive data you need to make those networks tick. Also, impracticable in slow training algorithms, beyond if they were in GPUs or CPUs.

teekert 96 days ago

If you are in the radiology field it started “exploding” much earlier, with CNNs.