| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joelthelion 947 days ago
	> My prediction would be that this will develop in the way that we can soon buy $1 hardware accelerators for things like word embedding, grammar, and general language understanding. And then you need those expensive GPUs only for the last few layers of your LLM, thereby massively reducing deployment costs. You'd still need a lot of RAM for storing these weights, wouldn't you? I mean, obviously, a $1 accelerator is a great improvement of x,000$ GPUs, but it doesn't mean we all get LLMs working on our phone just yet.

1 comments

fxtentacle 947 days ago

That's the beauty of their method: If you can replace a 8192x8192 matrix multiplication with a 8192x256 decision tree and then a 256x8192 look up table, your memory requirements go from 67,108,864 down to about 2,162,688 parameters. (I assumed that their decision tree for encoding is perfectly balanced and only uses log(256) parameters per row)

EDIT: And given that this work is centered around energy-efficiency and was sponsored by Huawei, I would guess that LLMs on your phone are precisely the goal here.

EDIT2: The process node that they did their calculations with appears to match Google's TPUv3 which has 0.56 TOPS/W and the paper claims 161 TOPS/W which would be a 280x improvement in energy efficiency over the AI chips in Pixel phones.

link

baq 947 days ago

Mind blown. Sounds almost too good to be true except the human brain runs on 20W and this brings us to the same ballpark. This was hard scifi a year ago!

Can an approach like this be integrated into stuff like llama.cpp so I could have a 200B model hashed down to 7B to run on civilian hardware or even a CPU?

link

fxtentacle 947 days ago

I'd expect that on a regular CPU, the RAM access latency will destroy any performance improvements. This work is much better suited for FPGAs or ASICs.

link

joennlae 947 days ago

Thank you for the feedback :-)

We have to be careful with the comparisons we make. The TPUv3 is a training and datacenter chip and not an Edge/Inference chip. They optimise for a different tradeoff, so while the comparison looks good, it is unfair.

link