Hacker News new | ask | show | jobs
by quietbuilder 93 days ago
44% cache hit rate is low. Over half the expert loads are cold reads off SSD, so at 1.4 GB/s effective bandwidth and ~1.8GB I/O per token, 4.74 tok/s checks out — but it'll drop with longer context or heavier reasoning.

Running 397B on consumer hardware is genuinely impressive for a proof of concept. A year ago this wasn't a thing. But I keep wondering whether a well-quantized 70B that fits entirely in RAM would just be faster in practice. No I/O bottleneck, consistent throughput, smaller model but actually usable.