|
|
|
|
|
by ein0p
419 days ago
|
|
Note that this is _way_ slower at small batch sizes you'd need for interactive use. At batch size 1 this seems to run at 1/3rd the speed of bf16 (so about 1/6th the speed of fp8 you'd realistically be using) if figure 5 is to be believed. This is actually a pretty impressive feat in itself if you know anything about GPU kernel programming, but it is much slower nevertheless. For this to work at "wire speed" it'd need hardware support, which takes years. Their "baseline" elsewhere in the paper is CPU offloading, which is dog slow and can't be made fast due to PCIe bottleneck. |
|