You could do sparse deep learning with much, much larger models with these CPUs. As paradoxical as it might sound, sparse deep learning gets more compute bound as you add more cores.
I'd be curious to learn more about how it's compute bound and what specifically is compute bound. On modern H100s you need ~600 fp8 operations per byte loaded from memory in order to be compute bound, and that's with full 128-byte loads each time. Even integer/fp32 vector operations need quite a few operations to be compute bound (~20 for vector fp32).
I think you misunderstood what I mean. Sparse ML is inherently memory latency bound since you have a completely unpredictable access pattern prone to cache misses. The amount of compute you perform is a tiny blip compared to the hash map operations you perform. What I mean is that as you add more cores, there are sharing effects because multiple cores are accessing the same memory location at the same time. The compute bound sections of your code become a much greater percentage of the overall runtime as you add cores, which is surprising, since adding more compute is the easy part. Pay attention to my words "_more_ compute bound".