| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zozbot234 6 days ago

> As far as I can tell you'll have a context limit of about 64k

Source? The author has demoed a 100k ctx already, and I can't think of a reason why more wouldn't be supported. RAM is a bit tight but that only matters with really long contexts on DeepSeek V4, and proper support for SSD streaming would address this anyway.

BTW, the official support is now merged too.

2 comments

SwellJoe 6 days ago

OK, I just tried it with the new mainline ROCm and MTP support, and it is faster, but still uncomfortably slow for interactive coding agent use. It does about 14-15 t/s, which is faster than the 10-11 t/s I was seeing before, but still a crawl. I set it loose on a small 300-line Perl file, and it's still chewing several minutes later.

So, it's super cool that such a solid model can run locally and it's probably useful for batched work overnight. But, I'm not going to sit around twiddling my thumbs while working. I think I can write code by hand faster than this. I'll gladly pay for a cloud model so I don't have to wait (especially since DeepSeek models are so cheap).

link

zozbot234 6 days ago

Well, that performance figure seems consistent with memory bandwidth on that machine (and its upcoming successor Gorgon Halo; Medusa Halo is projected to be faster) and even on DGX/RTX Spark. You'd get the same outcome on Apple Silicon Mn Pro (not Max or Ultra) if there was one with enough memory capacity. It's likely possible to raise aggregate tok/s on Strix Halo or DGX/RTX Spark (not realistically on Apple Silicon, at least not on a single machine) by batching multiple inference flows together, but that's admittedly a bit fiddly to implement and not what you're interested in anyway.

It seems that you'll want either top-of-the-line Apple Silicon (Max/Ultra) or cloud inference, which will always be super competitive if your focus is on low latency.

link

SwellJoe 6 days ago

No source, just back of the envelope math. 100k seems optimistic, but I guess I'll try it and see. That would be usable for at least a few use cases, including the security scanning work I'm focused on at the moment (at least, so far, the peak token usage has been 90k, which would make 100k tight but probably fine).

link