Suprisingly it's not *that* bad, with 3t/s for the quantized models: https://www.reddit.com/r/LocalLLaMA/comments/1in9qsg/boostin...
> NVidia ported it, and they claim almost 4 tokens/sec on 8xH100 server.
What? That sounds ridiculously low, someone just got 5.8t/s out of only one 3090 + CPU/RAM using the KTransformers inference library: https://www.reddit.com/r/LocalLLaMA/comments/1iq6ngx/ktransf...