|
|
|
|
|
by PhilippGille
3 days ago
|
|
The interesting bits on how they achieved it: > On the model side, we applied FP4 quantization > introduced DFlash, an efficient speculative decoding method based on block-level masked parallel prediction > On the system side, TileRT perfectly adapts to the dynamic characteristics of these algorithms > 1000+ tokens/s output [...] using just a single standard 8-GPU commodity node |
|