Unbatched token generation is basically RAM bandwidth limited, as the entire model has to be cycled through for each token. I bet theoretical performance is similar to the GPU, albeit with much lower power consumption.
Unbatched token generation is basically RAM bandwidth limited, as the entire model has to be cycled through for each token. I bet theoretical performance is similar to the GPU, albeit with much lower power consumption.