| HN Mirror

>How is the actual VRAM requirement calculated (says 175B on mid-range GPU)

Back of the envelope calculation for GPT-like models is simple: just divide the number of parameters by number of layers (ignore the embeddings and lm head size - you will overestimate layer size and add some safety margin that way) and multiply by size of your datatype. Also leave some RAM for activations and maybe KV-cache, but we will ignore this here. Say, for OPT-175B which has 96 layers at fp16 precision we have 2 * 175e9 / 96 ~= 3.65GB VRAM. It should fit into a 8GB GPU, I think.

If you want a precise way to calculate expected RAM consumption of your GPT-like model's layer, simply subtract the embedding and lm head size and divide the remainder by number of layers. Or you can instantiate one layer in python REPL and measure parameter count with this function https://stackoverflow.com/a/62508086

About performance. From my early behchmarks I see that:

1. You obviously need to store at least one layer (could be a dense sublayer - basically one matrix - and we could imagine a scheme to extend this library to load these dense layers as shards, but I don't think it's necessary right now) on the GPU or CPU+GPU at the precision of your model (most commonly bf16 or fp16, so 2 bytes per parameter).

2. In small batch regime, your inference performance is bottlenecked by disk read bandwidth (I could imagine it starts being bottlenecked by tensor materializing code for very fast SSDs - we might need a native extension here, but only after a good benchmark) - you can mask some, but not all of it by clever interleaving of tensor materialization and layer computation. As you grow batch size, you start being bottlenecked by computation throughput of your main computation engine (could be CPU, GPU or some exotic accelerator). At some point you can also become bottlenecked by memory bandwidth if your batch size is too small, but it is more about the case of running from RAM. Personally I don't yet see performance wins from stuffing less than all and more than one layer in RAM, but being able to fit all layers in RAM obviously makes a great difference.