| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 77 days ago
	The performance gain in the recent Flash-MoE implementations is seemingly obtained mostly by coalescing the data for each single MoE layer-expert into a single sequential extent which can be read efficiently from SSD. If so, this will actually require some changes in the underlying GGUF format; though the GGUF standard provides explicitly for specifying different data layouts, so the additions are arguably minor. As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req.

1 comments

Aurornis 77 days ago

> As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req.

Yes, read the first sentence of the PR for it. The project is a constant target for vibecoded PRs and they're trying to stay in front of it:

> In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit

link