Y
Hacker News
new
|
ask
|
show
|
jobs
by
rini17
742 days ago
Note that quantized versions of llama3 70B can be ran on CPU on much cheaper server. I am personally using it via llama.cpp on bare metal 6-core Xeon CPU with 128G RAM for ~50 euro monthly.
2 comments
JJneid
742 days ago
Is inference speed an issue for you?
link
rini17
742 days ago
Sufficient for fluent conversation.
link
JJneid
742 days ago
usually performance takes a hit with quantization. are you getting quality responses?
link
rini17
742 days ago
Since llama3, yes, quite satisfying.
link