Hacker News new | ask | show | jobs
by vikp 898 days ago
The size of the framework is not the most important factor - the model weights are usually 10x+ the size of the framework.

The most important factor is inference speed. For something called Nitro, I really expected speed benchmarks. I'd be interested in CPU, CUDA, and MPS at different batch sizes.

1 comments

AFAICT, Nitro is just a wrapper around llama.cpp. Therefore, you can simply look at llama.cpp benchmarks, of which there are plenty.
Oobagooda and other front ends and similar projects have in my testing had upwards of a 50% difference in inference speed on the same model and settings, So benchmarks are still useful.
Ooba is an outlier, and has tons of overhead over llama.cpp and llama-cpp-python for some reason.

Most llama.cpp openai servers are pretty close to vanilla llama.cpp, albeit without the batching support.