| HN Mirror

Narrow to a tiny training set? What are you talking about now? That has nothing to do with deep learning.

GPT-3.5 was trained on at least 300 billion tokens. It has 96 layers in its neural network of 175 billion parameters. Each one of those 96 stacked layers has an attention mechanism that recomputes an attention score for every token in the context window, for each new token generated in sequence. GPT-4 is much bigger than that. The scale and complexity of these models is beyond comprehension. We're talking about LLMs, not SLMs.