|
|
|
|
|
by BeeOnRope
2229 days ago
|
|
I don't have any Apple cores easily available, but the code [1] is open if anyone wants to try it (I don't know how POSIX-y the iOS compile environment is, though). One caveat is that just because you don't find a performance difference, doesn't mean the optimization isn't happening. It could simply be the case that write throughput is not the limiter, but rather the latency * occupancy product is the limiter. E.g., if it takes 50 ns to go from L2 to RAM, and there are only 10 buffers available to hold these requests, then the maximum bandwidth is 64 bytes / 50 nanos * 10 buffers = 12.8 GB/s regardless of the maximum possible bandwidth of each component. An eliminated store may still take this full latency (since it still has to read the value from RAM), so even if all writes are eliminated, the performance may remain at 12.8 GB/s - but you would save power and memory and L3 bandwidth for other cores... but the optimization would be tough to detect by looking at performance alone. You'd need to look at performance counters (does those exist for iOS devices?). [1] https://github.com/travisdowns/zero-fill-bench |
|