Hacker News new | ask | show | jobs
by rini17 742 days ago
Note that quantized versions of llama3 70B can be ran on CPU on much cheaper server. I am personally using it via llama.cpp on bare metal 6-core Xeon CPU with 128G RAM for ~50 euro monthly.
2 comments

Is inference speed an issue for you?
Sufficient for fluent conversation.
usually performance takes a hit with quantization. are you getting quality responses?
Since llama3, yes, quite satisfying.