| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nalzok 891 days ago
	Congratulations on the release! How can we download the model and run inference locally?

2 comments

austinvhuang 891 days ago

You can download the model checkpoints from kaggle https://www.kaggle.com/models/google/gemma and huggingface https://huggingface.co/blog/gemma

Besides the python implementations, we also implemented a standalone C++ implementation that runs locally with just CPU simd https://github.com/google/gemma.cpp

link

tveita 891 days ago

Are there any cool highlights you can give us about gemma.cpp? Does it have any technical advantages over llama.cpp? It looks like it introduces its own quantization format, is there a speed or accuracy gain over llama.cpp's 8-bit quantization?

link

janwas 890 days ago

Hi, I devised the 4.5 (NUQ) and 8-bit (SFP) compression schemes. These are prototypes that enabled reasonable inference speed without any fine-tuning, and compression/quantization running in a matter of seconds on a CPU.

We do not yet have full evals because the harness was added very recently, but observe that the non-uniform '4-bit' (plus tables, so 4.5) has twice the SNR of size-matched int4 with per-block scales.

One advantage that gemma.cpp offers is that the code is quite compact due to C++ and the single portable SIMD implementation (as opposed to SSE4, AVX2, NEON). We were able to integrate the new quantization quite easily, and further improvements are planned.

link

kathleenfromgdm 891 days ago

Thank you! You can get started downloading the model and running inference on Kaggle: https://www.kaggle.com/models/google/gemma ; for a full list of ways to interact with the model, you can check out https://ai.google.dev/gemma.

link

aphit 891 days ago

FYI the ; broke the link, but I found it easily anyway.

link

kathleenfromgdm 891 days ago

Good catch - just corrected. Thanks!

link