| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by everythingctl 60 days ago
	Maybe we can run more powerful models locally. I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.

2 comments

linuxhansl 60 days ago

The size of the KV cache (context stored) is proportional to the number of layers of the model and number of "hidden dimensions". For a 400B model it could be 30-60GB for just an 8K context window (depends on the model, etc, just a ballpark).

So shrinking that by 6x (from fp16), would be big win for larger models. True, while TurboQuant can also be applied to model weights, it won't save size over q4 compression, but will have better accuracy.

Edits: Better context

link

SilentM68 60 days ago

That's my hope as well as I tend to use low end GPUs (e.g. NVIDIA GeForce RTX 2060 @ 6GB). Been looking for an image generation model that can fit that vid card, for use with Ollama + GUI in Linux. No luck yet, since money's tight and jobs are tighter :(

link

MadnessASAP 60 days ago

An Arc B580 will just about fit Flux.2 Klein (At FP8). However, you can also easily get much larger GPUs on RunPod or Vast at $0.25/hr.

I would strongly recommend exploring that option, renting an RTX 5090 for an evening of image generation for a dollar or two is way more fun then trying to jam big models on little cards. Just take some time to create a reasonable, scripted, deployment workflow for when you create a fresh instance.

link

fragmede 59 days ago

hey what's your Venmo?

link

SilentM68 59 days ago

The B-52s, er I mean Base64s:

VGhvdWdoIEkgYXBwcmVjaWF0ZSB0aGUgZ2VzdHVyZSBhbmQga2luZCBpbnRlbnRpb25zLCBJJ2QgcmF0aGVyIGxlYXJuIHRvIGZpc2ggKG9yIGNhdGNoIHRoZSBiaWcgYmFycmFjdWRhKHMpIHRoYXQgc3RvbGUgdGhlIHNjaG9vbCBvZiBmaXNoIEkgd2FzIGdpZnRlZCwgd2hpY2ggd291bGQgaGF2ZSBrZXB0IG1lIGZlZCBmb3IgbXVsdGlwbGUgbGlmZXRpbWVzLCBhbmQgc3Bvb2tlZCBhIGZldyBteXN0ZXJ5IGZyaWVuZHMgaW4gdGhlIHByb2Nlc3MtLS1hIHRhc2sgSSBhbSBjbG9zZSB0byBjb21wbGV0aW5nKSwgYW5kIG5ldmVyIGdvIGh1bmdyeSB0aGFuIGVhdCBhIGZpc2ggZm9yIGEgZGF5IGFuZCBiZSBodW5ncnkgdGhlIG5leHQu

link

fragmede 57 days ago

WWVhaCBtYW4sIEkgaGVhciB5YS4gSSByZXNwZWN0IHRoYXQgeW91IHdhbnQgdG8gc29sdmUgdGhlIHJlYWwgcHJvYmxlbSwgbm90IGp1c3QgZ2V0IHRocm91Z2ggdG9kYXkuCgpCdXQgZXZlbiBzb21lb25lIHdobyBrbm93cyBob3cgdG8gZmlzaCBzdGlsbCBuZWVkcyBtb25leSB0byBidXkgYSBwb2xlIGluIG9yZGVyIHRvIGZpc2guIExldCB0aGlzIGJlIHRoYXQuIE5vdCBhIGhhbmRvdXQsIGp1c3Qgc3VwcG9ydCBmb3IgdGhlIHBhcnQgeW91IGFyZSBhbHJlYWR5IGRvaW5nLgo=

link