Hacker News new | ask | show | jobs
by brucethemoose2 1042 days ago
For prompt ingestion... I dunno.

Unbatched token generation is basically RAM bandwidth limited, as the entire model has to be cycled through for each token. I bet theoretical performance is similar to the GPU, albeit with much lower power consumption.