Hacker News new | ask | show | jobs
by suprjami 239 days ago
And a 14B model running at 22tg/s means you won't be using that 128G RAM for inference either.
2 comments

Yeah I’m honestly unclear on Nvidia’s thinking here - inference speed is unbelievably slow for the price.

Given the extreme advantage they have with CUDA and the whole AI/ML ecosystem, barely matching Apple’s M-ultra speeds is a choice…

Definitely a choice to give it low memory bandwidth. Probably to avoid customers thinking it can replace any data center GPU for inference use-cases.
You can use all 128 GB if you use a MoE model