|
|
|
|
|
by aseipp
134 days ago
|
|
llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why. I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally... |
|
--no-mmap --fa on options seemed to help, but not dramatically.
As with everything Spark, memory bandwidth is the limitation.
I'd like to be impressed with 30tok/sec but it's sort of a "leave it overnight and come back to the results" kind of experience, wouldn't replace my normal agent use.
However I suspect in a few days/weeks DeepInfra.com and others will have this model (maybe Groq, too?), and will serve it faster and for fairly cheap.