Does it even make sense calling them 'GPUs' (I just checked NVIDIA product page for the H100 and it is indeed so)?
There should be a quicker way to differentiate between 'consumer-grade hardware that is mainly meant to be used for gaming and can also run LLMs inference in a limited way' and 'business-grade hardware whose main purpose is AI training or running inference for LLMs".
I remember the building hugh end workstations for a summer job in the 2000s, where I had to fit Tesla cards in the machines. I don't remember what their device names were, we just called them tesla cards.
I think apple calls them NPUs and Broadcom calls them XPUs. Given they’re basically the number 2 and 3 accelerator manufacturers one of those probably works.
Consumer GPUs in theory and by a large margin (10 5090s will eat an H100 lunch with 6 times the bandwidth, 3x VRAM and a relatively similar compute ratio), but your bottleneck is the interconnect and that is intentionally crippled to avoid beowulf GPU clusters eating into their datacenter market.
Last consumer GPU with NVLink was the RTX 3090. Even the workstation-grade GPUs lost it.
H100s also has custom async WGMMA instructions among other things. From what I understand, at least the async instructions formalize the notion of pipelining, which engineers were already implicitly using because to optimize memory accesses you're effectively trying to overlap them in that kind of optimal parallel manner.
Although the RTX Pro 6000 is not consumer-grade, it does come with graphics ports (four Displayports) and does render graphics like a consumer card :) So seems the difference between the segments is becoming smaller, not bigger.
Sure, but it still sits in the 'business-grade hardware whose main purpose is AI training or running inference for LLMs" segment parent mentioned, yet have graphics connectors so the only thing I'm saying is that just looking at that won't help you understand what segment the GPU goes into.
With Ollama i got the 20B model running on 8 TitanX cards (2015). Ollama distributed the model so that the 15GB of vram required was split evenly accross the 8 cards. The tok/s were faster than reading speed.
Unless you’re running it 24/7 for multiple years, it’s not going to be cost effective to buy the GPU instead of renting a hosted one.
For personal use you wouldn’t get a recent generation data center card anyway. You’d get something like a Mac Studio or Strix Halo and deal with the slower speed.
I rented H100 for training a couple of times and I found that they couldn't do training at all. Same code worked fine on Mac M1 or RTX 5080, but on H100 I was getting completely different results.
So I wonder what I could be doing wrong. In the end I just use RTX 5080 as my models fit neatly in the available RAM.
* by not working at all, I mean the scripts worked, but results were wrong. As if H100 couldn't do maths properly.
This comment made my day ty! Yeah definitely speaking from a datacenter perspective -- fastest piece of hardware I have in the parts drawer is probably my old iPhone 8.
There should be a quicker way to differentiate between 'consumer-grade hardware that is mainly meant to be used for gaming and can also run LLMs inference in a limited way' and 'business-grade hardware whose main purpose is AI training or running inference for LLMs".