Hacker News new | ask | show | jobs
by Evidlo 606 days ago
Why don't they actually say what the size of the model is in GB?

That and average inference times on common hardware is what I'm curious about.

1 comments

The last table shows memory usage and performance on an Android phone.

> Decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average. The benchmarks can be reproducible today via ExecuTorch Llama instructions. The table above shows results using an Android OnePlus 12 device—however, we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B and Samsung S22 for 1B.