Hacker News new | ask | show | jobs
by tracerbulletx 859 days ago
It's a different inference engine with different capabilities. It should be a lot faster on Nvidia cards. I don't have comp benchmarks for llama.cpp but if you find some compare them to this.

https://nvidia.github.io/TensorRT-LLM/performance.html https://github.com/lapp0/lm-inference-engines/