| HN Mirror

It is very likely that you consume less power running a 1B LLM on an Nvidia supercluster than you do trying to download and run the same model on a smartphone. I don't think people understand just how fast the server hardware is compared to what is in their pocket.

We'll see companies push for tiny on-device models as a novelty, but even the best of those aren't very good. I firmly believe that GPUs are going to stay relevant even as models scale down, since they're still the fastest and most power-efficient solution.