Hacker News new | ask | show | jobs
by roadside_picnic 2 hours ago
See my comment to parent. I've been using local LLMs for practical, personal tasks for a few months now very successfuly.

You can run fantastic local models if you have either:

- M-series Apple device with ideally >= 24GB of VRAM

- RTX [345]090 GPU

I'm fortunate enough to have both and use an M-series laptop as basically a persistent server (I don't use it much and when traveling typically just use my work laptop). My desktop doesn't act as a persitent server but I fire up llama.cpp on it all time for quick chat sessions.

If you have one of the above devices and can dedicate it as server there are additional layers of tooling you can use that dramatically improve the experience. In particular Open WebUI allows you to add tons of useful tools (image gen, web search, code eval, etc), and agent harnesses like Hermes can make the current gen small models very capable. I have an agent in chat on my phone that basically handles all the sys-admin for the server it runs on.

1 comments

What about RTX 3080? Too little VRAM?
In addition to models getting better, the quantization methods have also got much better. If you already have an RTX 3080 it's absolutely worth the time to just mess around and see how it does, experiment with different quants that fit in your VRAM. If you're purchasing I would recommend coughing up the extra cash for the 3090.

If you are experimenting it's worth mentioning that the harness/tooling is very important to getting a solid experience. Herme's agent is great for running helpful agents and OpenWeb UI can get really make the experience feel on par with paid chat interfaced.

A reasonable halfway step is to pay for an open model through the provider or open router. You'll get many of the benefits (especially around pricing) without needing to shell out on hardware before deciding if you like the way these models work.