Hacker News new | ask | show | jobs
by yieldcrv 1169 days ago
I think there just hasn't been a consumer application that is really resource constrained, for a long time now. Only things for enthusiasts have been. LLMs have product market fit, but running a useful one client side is resource constrained, but instead of it truly being a consumer hardware limitation, it just turns out they were never optimized to begin with - coming from the perceived "top AI/ML minds" at FAANGs, while some of the most basic optimizations are seemingly a lost art.

On the other hand, its only been a few weeks, so maybe I should ignore this absurdity and just wait.

5 comments

Probably a combination of (a) ML framework people not paying much attention to CPU inference due to already having GPUs/TPUs already lying around for training - CPU inference is just for very quick experiments (b) research code has never been the best optimized for performance (c) ML people are not generally systems programmers, and a lot of systems programmers are afraid to mess with the ML code outside of low-level computation kernels (doesn't help that ML code is notoriously unreproducible).
It's indeed a very different world. This model was trained on thousands of GPUs. The weird file format corresponds to the train time sharding of the weights. And really nobody is doing CPU inference with all the GPU we have. And also the "CLI" use case seems contrieved to me. If you plan to interact several times with the model and want to keep the weights in RAM, why don't you start a REPL or spin up a server?
> while some of the most basic optimizations are seemingly a lost art

mmap isn't relevant to anyone except CPU-using programmers because other hardware doesn't have virtual memory paging. Firmware programmers don't care, GPU programmers don't care.

AFAIK CUDA offers unified memory which basically works with virtual address space and page faulting in data from main memory. There is also IOMMU in general.
Many of us would like to get rid of the host CPU and have ML trainers that are just GPUs and drives and NICs all attached to a northbridge. The GPU has everything required to make disk requests over the bus, and ideally the drive can receive network messages that get plumbed straight to the drive (I'm only partially joking).
Word embeddings were big for their time (especially with subword embeddings like fastText). We mmaped word embeddings for similar reasons. But yeah, I was kinda surprised that one post about LLaMa.cpp mmap support talked about a 'fairly new technique'. mmap has been in a UNIX programmer's tool belt for literally decades.
> never optimized to begin with

I think the better read is that they're being adapted to new applications, constraints, and environments, all at once.

Why would Facebook care about running LLAMA on a cpu with optimizing for 1-2% more latency when it has a lot of A100s laying around?