| HN Mirror

For prompt ingestion... I dunno.

Unbatched token generation is basically RAM bandwidth limited, as the entire model has to be cycled through for each token. I bet theoretical performance is similar to the GPU, albeit with much lower power consumption.