|
|
|
|
|
by oxcabe
486 days ago
|
|
Thanks for such a complete reply. I've locally tried ollama with the models and sizes you mention on a MacBook with M3 Pro chip. It often hallucinated, used a lot of battery and increased the hardware temperature substantially as well.
(Still, I'd argue I didn't put much time into configuring it, which could've solve the hallucinations) Ideally, we should all have accesss to local, offline, private LLM usage, but hardware contraints are the biggest limiter right now. FWIW, a controlled (running in hardware you own, local or not) agent with the aforementioned characteristics could be applied as a "proxy" that filters out or redacts specific parts of your data to avoid sharing information you don't want others to have. Having said this, you wouldn't be able to integrate such system on a product like this unless you also make some sort of proxy gmail account serving as a computed, privacy controlled version of your original account. |
|
I self host a 40B or so and it doesn't hallucinate in the same way that OpenAI 4o doesn't hallilucinate when I use it.
Small models are incredibly impressive but require a lot more attention to how you interact with it. There are tools like aider that can take advantage of the speed of smaller models and have a larger model check for obvious BS.
I think this idea got spread because at least deepseek qwen distilled and llama support this now you can use a 20GB llama and pair it with a 1.5B parameter model and it screams. The small model usually manages 30-50% of the total output tokens, with the rest corrected by the large model.
This results in a ~30-50% speedup, ostensibly. I haven't literally compared but it is a lot faster than it was for barely any more memory commit.