Hacker News new | ask | show | jobs
by zozbot234 77 days ago
The performance gain in the recent Flash-MoE implementations is seemingly obtained mostly by coalescing the data for each single MoE layer-expert into a single sequential extent which can be read efficiently from SSD. If so, this will actually require some changes in the underlying GGUF format; though the GGUF standard provides explicitly for specifying different data layouts, so the additions are arguably minor.

As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req.

1 comments

> As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req.

Yes, read the first sentence of the PR for it. The project is a constant target for vibecoded PRs and they're trying to stay in front of it:

> In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit