|
|
|
|
|
by why_only_15
919 days ago
|
|
I'd be curious to learn more about how it's compute bound and what specifically is compute bound. On modern H100s you need ~600 fp8 operations per byte loaded from memory in order to be compute bound, and that's with full 128-byte loads each time. Even integer/fp32 vector operations need quite a few operations to be compute bound (~20 for vector fp32). |
|
Here is a relevant article: https://www.kdnuggets.com/2020/03/deep-learning-breakthrough...