|
|
|
|
|
by genewitch
480 days ago
|
|
I hate to be this person but the system prompt matters. The model size matters. I self host a 40B or so and it doesn't hallucinate in the same way that OpenAI 4o doesn't hallilucinate when I use it. Small models are incredibly impressive but require a lot more attention to how you interact with it. There are tools like aider that can take advantage of the speed of smaller models and have a larger model check for obvious BS. I think this idea got spread because at least deepseek qwen distilled and llama support this now you can use a 20GB llama and pair it with a 1.5B parameter model and it screams. The small model usually manages 30-50% of the total output tokens, with the rest corrected by the large model. This results in a ~30-50% speedup, ostensibly. I haven't literally compared but it is a lot faster than it was for barely any more memory commit. |
|