| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by druide67 86 days ago
	The finding about removing the 9.8 GB Metal LRU cache for a 38% speedup is the most interesting part. Same lesson as PostgreSQL's advice against application-level buffer pools that compete with the OS page cache : the hardware memory compressor doing 130K decompressions/sec was pure overhead. Curious about the remaining gap: 5.7 tok/s vs 18.6 theoretical (from SSD bandwidth). Is the ~70% overhead mostly GPU compute on non-expert layers (attention, norm), or is there I/O scheduling room left?