Hacker News new | ask | show | jobs
by kossTKR 1100 days ago
Does this mean that GPT-4/65b level performance is closer to running on a say a m1/m2 with only 24+ gigabytes of ram?
2 comments

Nope. You will still need a proper GPU. You can't yet run large language models on tiny hardware like an m1/m2. Even the llama.cpp magic is only possible with very small models at beam size 1, which really limits the "creativity" of these models.
We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.

With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.

Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).

[0] - https://github.com/toverainc/willow

[1] - https://github.com/toverainc/willow-inference-server

[2] - https://github.com/toverainc/willow-inference-server/tree/wi...

I am in fact running my own instance of the Willow Inference Server (née air-infer-api) against a Tesla P4 8GB gifted to me by our mutual friend Richard. It works wonderfully, up to IIRC 3 chunks of audio. We really need to implement streaming so I can use it to close caption videos without subtitles.

For others in this thread, if you haven't tried Willow yet, check it out, as it is an amazing leap forward and can actually run on some pretty small GPUs. LLMs are hogging the AI spotlight but you will struggle to run them on consumer hardware. Image and audio processing models are generally much smaller and more approachable.

Not really. vLLM optimizes the throughput of your LLM, but does not reduce the minimum required amount of resource to run your model.
But (in theory) - llama.cpp could implement similar approach to paging/memory and see a speedup for 4bit models on cpu?