|
|
|
|
|
by brucethemoose2
1044 days ago
|
|
> 159s for the 256 This is still extremly slow for that CPU, compared to the quantized model. IIRC the llama.cpp f32 code is basically a placeholder. BUT the threading overhead is a known performance issue, and I'm sure Java handles that better. |
|
I didn't know about it, I should have... are there any "edge" frameworks as complete as ggml/llama.cpp that you know of that are faster now? Ggml is still very easy to use which I like, but I'd always thought of it as the fastest, in particular for CPU, I hadn't noticed there were known performance issues.