Hacker News new | ask | show | jobs
by adam_arthur 1249 days ago
So it sounds like this is a question of loading the model into VRAM, and not a question of the cost of a single query. I assume once a model is loaded, many queries can be serviced by that model quickly.

There's nothing incorrect about my assertion. If it were to actually take many GPUs to service one query, then there is no mass scale cost viable consumer product. That's just a clear economic fact. Regardless if a model could be theoretically spun up in a cost inefficient manner.

And even 100s of GB of VRAM is not far off from consumer hardware. Look at how quickly graphics ram has expanded over time. About ~10x in ~10 years for high end cards, at a cursory glance at various Nvidia cards. At the same trajectory we could see a 400GB vram card within the next decade (though lots of assumptions)

2 comments

> I assume once a model is loaded, many queries can be serviced by that model quickly.

Depends. If you have room to load the whole model, yes. If you need to swap in and out parts of the model, then it matters if you have enough RAM.

You really are like a chatbot... look at the last three node sizes and the density of ram in them. It's not gonna happen as fast as you dream about it especially not with the discounts of the last Gens. The hope is to go to fp4 if you want to run it on consumer hardware and we are still not talking to about 2-3 cards. Why not at least try to Google before hammering down on stupid and uninformed hot takes?