Hacker News new | ask | show | jobs
by Der_Einzige 1062 days ago
Any data on inference speed? I’ve found that the non quantized model was much faster on GPU than the quantized versions due to lower GPU utilization.
1 comments

It's a RAM tradeoff. If you have enough GPU RAM to load the non-quantized model it may be faster.