Hacker News new | ask | show | jobs
by oconnor663 1478 days ago
The part about data dependencies across loop iterations is fascinating to me, becuase it's mostly invisible even when you look at the generated assembly. There's a related optimization that comes up in implementations of ChaCha/BLAKE, where we permute columns around in a kind of weird order, because it breaks a data dependency for an operation that's about to happen: https://github.com/sneves/blake2-avx2/pull/4#issuecomment-50...
1 comments

The pipelining issue is interesting because my reaction becomes “shouldn’t the CPU just come with a larger vector size and then operate on chunks within the vector to optimize pipelining?” but then I realize I’m just describing a GPU.