Hacker News new | ask | show | jobs
by littlestymaar 489 days ago
> Did they get the first token out? ;)

Suprisingly it's not *that* bad, with 3t/s for the quantized models: https://www.reddit.com/r/LocalLLaMA/comments/1in9qsg/boostin...

> NVidia ported it, and they claim almost 4 tokens/sec on 8xH100 server.

What? That sounds ridiculously low, someone just got 5.8t/s out of only one 3090 + CPU/RAM using the KTransformers inference library: https://www.reddit.com/r/LocalLLaMA/comments/1iq6ngx/ktransf...