| Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction: - Big models like GPT-4 are split across many GPUs (sharding). - Each GPU holds some layers in VRAM. - To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math. - Loading into cache is slow, the ops are fast though. - Without batching: load layer > compute user1 > load again > compute user2. - With batching: load layer once > compute for all users > send to gpu 2 etc - This makes cost per user drop massively if you have enough simultaneous users. - But bigger batches need more GPU memory for activations, so there's a max size. This does makes sense to me but does this sound accurate to you? Would love to know if I'm still missing something important. |
The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.