| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Yukonv 114 days ago
	Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware. [0] https://huggingface.co/unsloth/GLM-5.1-GGUF

1 comments

zozbot234 114 days ago

SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.

Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.

link

adrian_b 114 days ago

For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.

link

QuantumNomad_ 114 days ago

Three hour coffee break while the LLM prepares scaffolding for the project.

link

pbhjpbhj 114 days ago

Like computing used to be. When I first compiled a Linux kernel it ran overnight on a Pentium-S. I had little idea what I was doing, probably compiled all the modules by mistake.

link

stingraycharles 114 days ago

I remember that time, where compiling Linux kernels was measured in hours. Then multi-core computing arrived, and after a few years it was down to 10 minutes.

With LLMs it feels more like the old punchcards, though.

link

drowsspa 114 days ago

At least the compiler was free

link

adrian_b 114 days ago

The point of doing local inference with huge models stored on an SSD is to do it free, even if slow.

link

tempoponet 113 days ago

Rather, Imagine you have 2-3 of these working 24/7 on top of what you're doing today. What does your backlog look like a month from now?

link

zozbot234 114 days ago

Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.

link