Hacker News new | ask | show | jobs
by msbhogavi 83 days ago
"As much memory as possible" is right for model capacity but misses bandwidth. Apple Silicon has distinct tiers: M4 Pro at 273 GB/s, M4 Max at 546 GB/s, M4 Ultra at 819 GB/s. Bandwidth determines tok/s once the model fits in memory. An M4 Max gives you 2x the decode speed of an M4 Pro on the same model.

For what Hypura does, the Max is the sweet spot. 64GB loads a 70B at Q4 with room to spare, and double the bandwidth of the Pro means generation is actually usable instead of just technically possible.