Jlama (Java) outperforms llama.cpp in F32 Llama 7B Model

Y	Hacker News new \| ask \| show \| jobs

	Jlama (Java) outperforms llama.cpp in F32 Llama 7B Model (twitter.com)
	7 points by tjake 1045 days ago

3 comments

syllogistic 1045 days ago

Huh, yeah it repros. Java is faster 159s vs 203s for the 256 tokens on my intel i9 12 gen

link

brucethemoose2 1045 days ago

> 159s for the 256

This is still extremly slow for that CPU, compared to the quantized model.

IIRC the llama.cpp f32 code is basically a placeholder.

BUT the threading overhead is a known performance issue, and I'm sure Java handles that better.

link

version_five 1045 days ago

> threading overhead is a known performance issue

I didn't know about it, I should have... are there any "edge" frameworks as complete as ggml/llama.cpp that you know of that are faster now? Ggml is still very easy to use which I like, but I'd always thought of it as the fastest, in particular for CPU, I hadn't noticed there were known performance issues.

link

brucethemoose2 1045 days ago

Apache TVM, with (for instance) mlc-llm. It will compile to CPU, Vulkan, and other esoteric backends, and its autotuning is like black magic.

Llama.cpp is still SOTA on CPU, as far as I know, especially with a small discrete GPU to help with long prompt ingestion. And it has tons of features (like grammar, context extending and good quant) that other frameworks are still missing.

link

version_five 1045 days ago

Where does the performance difference come from? And in what kind of processor & gpu? I didn't even know llama.cpp had a 32 bit option. For now I'm pretty suspicious it's a fair comparison.

link

tjake 1045 days ago

The default for `convert.py` is F32. This is just SIMD CPU comparison.

Jlama uses the vector api in java20 but also better thread scheduling with work stealing and zero allocation.

link

belfthrow 1044 days ago

Could you link to some of the examples in your repo where you enforce the zero allocation? I don't see much reuse of the buffers, eg float buffers and there is quite a lot of array based heap allocation. Just for my own interest. Many thanks. Cool to see the use of the new vector api also.

link

version_five 1045 days ago

Very interesting, I'll watch for the quantized version.

link

tjake 1045 days ago

GH: https://github.com/tjake/Jlama

link