| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by m0th87 417 days ago
	That’s what I hope for, but everything that isn’t bananas expensive with unified memory has very low memory bandwidth. DGX (Digits), Framework Desktop, and non-Ultra Macs are all around 128 gb/s, and will produce single digits tokens per second for larger models: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen... So there’s a fundamental tradeoff between cost, inference speed, and hostable model size for the foreseeable future.