Llama2.java: Karpathy's llama2.c ported to Java

Y	Hacker News new \| ask \| show \| jobs

	Llama2.java: Karpathy's llama2.c ported to Java (github.com)
	33 points by mukel 1050 days ago

5 comments

gavinray 1050 days ago

The Java code is impressively written, using newer features like MemorySegment.

Looked at the author and realized it's Alfonso from the Graal team -- makes sense.

I wonder whether the "matmul" code could be further optimized with the Vector API and SIMD.

link

mukel 1050 days ago

Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.

link

kurhan 1050 days ago

Also new virtual threads might be beneficial. I was experimenting using Vector api for matrix multiplication once and effect was pretty good.

link

mike_hearn 1050 days ago

Virtual threads shouldn't help as the program isn't I/O or wait bottlenecked. It's a pure computation, so it's all about vectorization here.

link

atairov 1045 days ago

Thanks for sharing this! It's great to have a reference implementation written on java lang. With given original simplicity it's really easy to follow llama architecture logic.

Just in case if anyone interested in Python version, I spend some time on weekend and ported it to pure python -- https://github.com/tairov/llama2.py

I never knew that it would take about 500 lines of core part code to implement inference for such a cutting edge AI technology.

link

mukel 1050 days ago

A Java port of llama2.c that performs very close to C on large models. Llama 2 7B runs at a whooping 1.6 tokens/s.

link

mike_hearn 1050 days ago

Hey man, awesome stuff. Surely any JIT compiler will struggle to vectorize something using IntStream.range, though? Looking at matmul, I'd not expect that to be auto-vectorized. The Panama API can be used to do a matmul vectorization, too bad it seems to never launch.

link

mwcampbell 1050 days ago

Panama is now in its third preview in the soon-to-be-released JDK 21:

https://openjdk.org/jeps/442

Is there any indication that it won't go from there to a final release soon?

link

mike_hearn 1049 days ago

That's only for the FFI I think. The vector API has been incubated six times now and is waiting for Valhalla :(

link

shortrounddev2 1050 days ago

How you all used these things for anything useful? I can't get them to give useful results on my 3060 8gb. If I wanted to get decent results I think I'd need to rent a GPU node somewhere, but chatGPT is still free

link

SushiHippie 1050 days ago

The 4bit quantized 13B models, give really decent answers (not as good as gpt4, but often as good as gpt 3)

link

nmfisher 1050 days ago

I know it might be asking a lot, but it would be great if someone could put up a HF space so I could try all the various flavours/sizes.

link

lazylion2 1050 days ago

/r/LocalLLaMA/

link

nmfisher 1050 days ago

I'm already subscribed (and I already ran the small version locally), but I'd still like to be able to quickly evaluate the models online in a couple of minutes, rather than going through the rigmarole of downloading & running every new model/variant locally.

link

jiehong 1050 days ago

This makes me wonder: what’s the status of GPU programming on the JVM?

Any abstraction for GPGPU or shaders programming?

link

pjmlp 1049 days ago

Besides TornadoVM,

http://javagl.de/jcuda.org/

https://dragan.rocks/software/

https://blogs.oracle.com/javamagazine/post/programming-the-g...

link

mike_hearn 1050 days ago

See here: https://www.tornadovm.org/

But it's a research project.

link

jfumero 1047 days ago

To quote Gary Frost (creator of Aparapi), TornadoVM is the state-of-the-art right now. He mentioned this at JVMLS 2023. Hopefully the videos will be available soon from this link: https://openjdk.org/projects/mlvm/jvmlangsummit/

link