| HN Mirror

I think you misunderstood what I mean. Sparse ML is inherently memory latency bound since you have a completely unpredictable access pattern prone to cache misses. The amount of compute you perform is a tiny blip compared to the hash map operations you perform. What I mean is that as you add more cores, there are sharing effects because multiple cores are accessing the same memory location at the same time. The compute bound sections of your code become a much greater percentage of the overall runtime as you add cores, which is surprising, since adding more compute is the easy part. Pay attention to my words "_more_ compute bound".

Here is a relevant article: https://www.kdnuggets.com/2020/03/deep-learning-breakthrough...