|
|
|
|
|
by storystarling
141 days ago
|
|
The killer app here is likely LLM inference loops. Currently you pay a PCIe latency penalty for every single token generated because the CPU has to handle the sampling and control logic. Moving that logic to the GPU and keeping the whole generation loop local avoids that round trip, which turns out to be a major bottleneck for interactive latency. |
|