Hacker News new | ask | show | jobs
by m0th87 370 days ago
That’s what I hope for, but everything that isn’t bananas expensive with unified memory has very low memory bandwidth. DGX (Digits), Framework Desktop, and non-Ultra Macs are all around 128 gb/s, and will produce single digits tokens per second for larger models: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

So there’s a fundamental tradeoff between cost, inference speed, and hostable model size for the foreseeable future.