|
|
|
|
|
by menaerus
88 days ago
|
|
I am trying to understand the reason behind why "zeroing got cheaper" circa 2012-2014. Do you have some plausible explanations that you can share? Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s. Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization. |
|