|
|
|
|
|
by hnuser123456
286 days ago
|
|
With quantization-aware-training techniques, q4 models are less than 1% off from bf16 models. And yes, if your use case hinges on the very latest and largest cloud-scale models, there are things they can do the local ones just can't. But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too. If anyone has a gaming GPU with gobs of VRAM, I highly encourage they experiment with creating long-running local-LLM apps. We need more independent tinkering in this space. |
|
Again, what's the use case? What would make sense to run, at high rates, where output quality isn't much of a concern? I'm genuinely interested in this question, because answering it always seems to be avoided.