|
|
|
|
|
by heymijo
433 days ago
|
|
GPUs, very good for pretraining. Inefficient for inference. Why? For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word. GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical. Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book. [0] https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN... |
|
I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.