|
|
|
|
|
by adam_arthur
1249 days ago
|
|
So it sounds like this is a question of loading the model into VRAM, and not a question of the cost of a single query. I assume once a model is loaded, many queries can be serviced by that model quickly. There's nothing incorrect about my assertion. If it were to actually take many GPUs to service one query, then there is no mass scale cost viable consumer product. That's just a clear economic fact. Regardless if a model could be theoretically spun up in a cost inefficient manner. And even 100s of GB of VRAM is not far off from consumer hardware. Look at how quickly graphics ram has expanded over time. About ~10x in ~10 years for high end cards, at a cursory glance at various Nvidia cards. At the same trajectory we could see a 400GB vram card within the next decade (though lots of assumptions) |
|
Depends. If you have room to load the whole model, yes. If you need to swap in and out parts of the model, then it matters if you have enough RAM.