| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brucethemoose2 1044 days ago

> 159s for the 256

This is still extremly slow for that CPU, compared to the quantized model.

IIRC the llama.cpp f32 code is basically a placeholder.

BUT the threading overhead is a known performance issue, and I'm sure Java handles that better.

1 comments

version_five 1044 days ago

> threading overhead is a known performance issue

I didn't know about it, I should have... are there any "edge" frameworks as complete as ggml/llama.cpp that you know of that are faster now? Ggml is still very easy to use which I like, but I'd always thought of it as the fastest, in particular for CPU, I hadn't noticed there were known performance issues.

link

brucethemoose2 1044 days ago

Apache TVM, with (for instance) mlc-llm. It will compile to CPU, Vulkan, and other esoteric backends, and its autotuning is like black magic.

Llama.cpp is still SOTA on CPU, as far as I know, especially with a small discrete GPU to help with long prompt ingestion. And it has tons of features (like grammar, context extending and good quant) that other frameworks are still missing.

link