| HN Mirror

But we got here step by step, as other interesting use cases came up by using somewhat less compute. Image recognition, early forms of image generation, AlphaGo, AlphaZero for chess. All earlier forms of deep neural networks that are much more reasonable than training a top of the line LLM today, but seemed expensive at the time. And ultimately a lot of this also comes from the hardware advancements and the math advancements. If you took classes neural networks in the 1990s, you'd notice that they mostly talked about 1 or 2 hidden layers, and not all that much focus on the math to train large networks, precisely because of how daunting the compute costs were for anything that wasn't a toy. But then came video card hardware, and improvements to use it to do gradient descent, making going past silly 3 layer networks somewhat reasonable.

Every bet makes perfect sense after you consider how promising the previous one looked, and how much cheaper the compute was getting. Imagine being tasked to train an LLM in 1995: All the architectural knowledge we have today and a state-level mandate would not have gotten all that far. Just the amount of fast memory that we put to bear wouldn't have been viable until relatively recently.