Hacker News new | ask | show | jobs
by druide67 86 days ago
The finding about removing the 9.8 GB Metal LRU cache for a 38% speedup is the most interesting part. Same lesson as PostgreSQL's advice against application-level buffer pools that compete with the OS page cache : the hardware memory compressor doing 130K decompressions/sec was pure overhead.

Curious about the remaining gap: 5.7 tok/s vs 18.6 theoretical (from SSD bandwidth). Is the ~70% overhead mostly GPU compute on non-expert layers (attention, norm), or is there I/O scheduling room left?