Hacker News new | ask | show | jobs
by aurareturn 332 days ago
Probably because they are loading the entire model into SRAM. Thats how they can achieve 1.5k tokens/s.