Hacker News new | ask | show | jobs
by pbnjay 726 days ago
Yeah... They are using a single-core 13W measurement to project out. For a 64x parallelization - no mention of any overhead due to parallelization or power needs of the supporting hardware. This is a key quote for me (page 12 of the PDF):

> The 1.3B parameter model, where L = 24 and d = 2048, has a projected runtime of 42ms, and a throughput of 23.8 tokens per second.

e.g. 64 x 13.67W = 874 Watts to run a 1.3B model at 23.8 t/s... I'm pretty sure my phone can do way better than that! Even half that power given their assertions in the table are still overpowered for such a small model.

1 comments

When you multiply by 64 you also get 64 times more tokens per second!! Your math is wrong.
That's their math, the 23.8t/s is already the 64x but they didn't 64x the other stats.
When you multiply by 64 you also get 64 times more tokens per second!! Your math is wrong.