Hacker News new | ask | show | jobs
by zozbot234 77 days ago
You can almost always use less RAM by making inference slower. Streaming MoE active weights from SSD is an especially effective variety of this, but even with a large dense model, you could run inference on a layer-wise basis (perhaps coalescing only a few layers at a time) if the model on its own is too large for your RAM. You need to store the KV-cache, but that takes only modest space and at least for ordinary transformers (no linear attention tricks) is append-only, which fits well with writing it to SSD (AIUI, this is also how "cached" prompts/conversations work under the hood).