Hacker News new | ask | show | jobs
by Der_Einzige 2184 days ago
Yet another paper with results that basically look like this: https://d3b8hk1o42ev08.cloudfront.net/wp-content/uploads/201...

Still impressive, don't get me wrong, but I am starting to believe that NLP will be dominated increasingly by the big players since they are the only ones who can train a 1 TRILLION parameter model (they show that in the paper). I can't even do inference with a 36 layer, 2048 neuron per layer network with my GTX 2080ti. Sad....

2 comments

"I can't even do inference with a 36 layer, 2048 neuron per layer network with my GTX 2080ti."

Not even for a single instance? Your GPU has 11GB of RAM. Why isn't 14k per neuron enough? Is the input really large, or does each neuron have very high precision?

There's an extremely large number of parameters per "neuron". The 600B parameters will take up more than 1TB of space in memory, far too much for the 2080 TI or even main memory for most systems.
I'm not talking about inference on a 600B parameter model. GP said they can't do inference on a 32-layer, 2048 neurons-per-layer network. Let's assume every layer is fully connected. So each neuron will have 2048 parameters. So that's 32 * 2048 * 2048 parameters. That's 132MM parameters in 11GB of RAM, or 82 bytes per parameter. If each parameter is 4 bytes (that seems like a lot of precision), plus 4 bytes per calculated value, you're still only using 10% of the GPU's RAM. You should be able to do inference on a batch of 16-20 examples at a time.

What have I missed?

2048 neurons per layer isn't really an accurate description, what he means is 2048 dimensional embeddings at each layer. The actual multihead attention layers in a transformer are not just feed forward 2048*2048, but actually have many more parameters. That's why there's 600B total.
Stay tuned for algorithmic advancements.