| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anon373839 919 days ago

No, I am referring to the 7Bx8 MoE model. The MoE layers apparently can be sparsified (or equivalently, quantized down to a single bit per weight) with minimal loss of quality.

Inference on a quantized model is faster, not slower.

However, I have no idea how practical it is to run a LLM on a phone. I think it would run hot and waste the battery.

1 comments

stavros 919 days ago

Really? Well that's very exciting. I don't care about wasting my battery if it can do my menial tasks for me, battery is a currency I'd gladly pay for this use case.

link

brandall10 919 days ago

I think we're at least a couple generations away where this is feasible for these models, unless say it's for performing a limited number of background tasks at fairly slow inference speed. SOC power draw limits will probably limit inference speed to about 2-5 tok/sec (lower end for Mixtral which has the processing requirements of a 14B) and would suck an iPhone Max dry in about an hour.

link

TeMPOraL 919 days ago

Maybe this way, we'll not just get user-replaceable batteries in smartphones back - maybe we'll get hot-swappable batteries for phones, as everyone will be happy to carry a bag of extra batteries if it means using advanced AI capabilities for the whole day, instead of 15 minutes.

link

stavros 919 days ago

Or maybe we'll finally get better batteries!

Though I guess that's not for lack of trying.

link