Hacker News new | ask | show | jobs
by PhilippGille 3 days ago
The interesting bits on how they achieved it:

> On the model side, we applied FP4 quantization

> introduced DFlash, an efficient speculative decoding method based on block-level masked parallel prediction

> On the system side, TileRT perfectly adapts to the dynamic characteristics of these algorithms

> 1000+ tokens/s output [...] using just a single standard 8-GPU commodity node