| HN Mirror

I hate to be this person but the system prompt matters. The model size matters.

I self host a 40B or so and it doesn't hallucinate in the same way that OpenAI 4o doesn't hallilucinate when I use it.

Small models are incredibly impressive but require a lot more attention to how you interact with it. There are tools like aider that can take advantage of the speed of smaller models and have a larger model check for obvious BS.

I think this idea got spread because at least deepseek qwen distilled and llama support this now you can use a 20GB llama and pair it with a 1.5B parameter model and it screams. The small model usually manages 30-50% of the total output tokens, with the rest corrected by the large model.

This results in a ~30-50% speedup, ostensibly. I haven't literally compared but it is a lot faster than it was for barely any more memory commit.