| HN Mirror

Hi, I devised the 4.5 (NUQ) and 8-bit (SFP) compression schemes. These are prototypes that enabled reasonable inference speed without any fine-tuning, and compression/quantization running in a matter of seconds on a CPU.

We do not yet have full evals because the harness was added very recently, but observe that the non-uniform '4-bit' (plus tables, so 4.5) has twice the SNR of size-matched int4 with per-block scales.

One advantage that gemma.cpp offers is that the code is quite compact due to C++ and the single portable SIMD implementation (as opposed to SSE4, AVX2, NEON). We were able to integrate the new quantization quite easily, and further improvements are planned.