| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by theaiquestion 1106 days ago

Compete with Llama.cpp? Like transformers llama [0], exllama [1] (really fast), or litllama [2] ?

exllama is really memory efficient and really fast

[0] https://huggingface.co/docs/transformers/main/model_doc/llam...

[1] https://github.com/turboderp/exllama

[2] https://github.com/Lightning-AI/lit-llama

EDIT: Or do you mean cuda? Because yeah, it's such a shame AMD's Rocm is so bad even geohot gave up. it's examples don't even run without crashing.

https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...

2 comments

kayvr 1106 days ago

Also https://github.com/kayvr/TokenHawk, a WebGPU implementation of LLaMA.

edit: Note that this is my project.

link

dTal 1106 days ago

Thanks for the tip about exllama, I've been on the lookout for a readable python implementation to play with that is also fast and has support for quantized datasets.

link