| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by suprjami 239 days ago
	And a 14B model running at 22tg/s means you won't be using that 128G RAM for inference either.

2 comments

Yeah I’m honestly unclear on Nvidia’s thinking here - inference speed is unbelievably slow for the price.

Given the extreme advantage they have with CUDA and the whole AI/ML ecosystem, barely matching Apple’s M-ultra speeds is a choice…

Definitely a choice to give it low memory bandwidth. Probably to avoid customers thinking it can replace any data center GPU for inference use-cases.

You can use all 128 GB if you use a MoE model