Hacker News new | ask | show | jobs
by neutrinobro 290 days ago
I have a old system with 3 ancient Tesla K40s which can easily run inference on ~30B parameter models (e.g. qwen3-coder:30b). I mostly use it as a compute box for other workloads, but its not completely incapable for some AI assisted coding. It is power hungry though, and the recent spike in local electricity rates is enough of an excuse to keep it off most of the time.
1 comments

I'm surprised the accelerators of yore trick actually worked and balancing a trio is trivially more difficult than duo? I enjoy the idea of having tons of VRAM and system RAM and loading a big model and getting responses a few times per hour as long as its high quality
Yeah, I was equally surprised. I am using a patched version of ollama to run the models: https://github.com/austinksmith/ollama37 which has a trivial change to allow it to run with old versions of cuda (3.5, 3.7). Obviously this was before tensor cores were a thing, so you're not going to be blown away by the performance, but it was cheap. I got 3x k40s for $75 on ebay, they are passively cooled, so they do need to be in a server chassis.