My issue is figuring out how to identify how many concurrent users you can support on average on a given GPU.
Understanding the vram to simply load the weights is easy enough. When you are allowing for something like content generation with varying lengths of input/output tokens, how do you even begin to identify the GPUs you need?
Understanding the vram to simply load the weights is easy enough. When you are allowing for something like content generation with varying lengths of input/output tokens, how do you even begin to identify the GPUs you need?