Hacker News new | ask | show | jobs
by LoganDark 21 hours ago
I have 128 GB of unified memory (M4 Max) and the user experience with local inference is still pretty bad. I'm so glad something like llama.cpp exists so I don't have to wrangle Python (which I hate), but OpenCode is entirely disrespectful of the KV-cache so I had to switch to Pi (but Pi is going relatively well actually).

Even so, I can't really run at hundreds of tokens per second which is practically table stakes for my work. Even if I did manage to run that fast, the model would probably be completely braindead and stomp all over the task.

Wish I could afford an M5 Max but I've been between jobs for months without even a single interview. Sucks to be a developer these days.

1 comments

Try Kilocode with deepseek v4 (via API directly to deepseek, much cheaper than via kilo).

I have had very good results and compared to others it just costs pennies.

I use something similar to this https://github.com/ScotterMonk/AgentAutoFlow setup and switch between deepseek v4 to flash depending on task.

Deepseek Flash v4 actually runs on 128Gb systems (about 14 tok/sec). Antirez created a fabulous 2 bit quant and a highly tuned LLM server

https://github.com/antirez/ds4

I do use DeepSeek, it's exceptionally cheap! Inference is slow though, and it's not particularly intelligent but the experience is better than local inference.