| This "run two loops in parallel" pattern is an _incredibly_ common pattern for these "high speed benchmarks". A whole lot of different programs in the high-speed space (this CRC32, the lookup3 Jenkins Hash, My own AES random number generator, etc. etc.). Instead of the programmer manually thinking of the "two independent pipelines", its quite possible that we can imagine a language where the two pipelines were programmed separately, and then a compiler merges them together knowing about the pipeline details (ex: Skylake is 1-per-clock throughput / 3-clock latency, AMD might be different like 1-per-clock throughput / 4-clock latency or something). The programmer's job would be to "separate out the two independent threads of execution", while the compiler's job is to "merge them back together into one instruction stream". Much like how SIMD-code is written today, the programmer is responsible for finding the parallelism. The compiler / machine is responsible for efficient execution. -------- As it is, today's high speed programmers have to do both tasks simultaneously. We have a mental model of the internal registers, pipelines, throughput, latencies of different execution units, and manually schedule them to match our mental model. (But that mental model changes as new CPUs come out every few years). The hard part is figuring out the parallelism. I don't think we have a language that describes this fine-grained parallelism though, in any case. Just a "what if we lived in a magically perfect world" kinda hypothetical here... EDIT: Alternatively, you can "cut dependencies" and hope that the compiler discovers your (intended) low dependency chain. IE: Manually unroll loops and whatever, which works better in practice than you might think (and such manual unrolling often seems to trigger the AVX autovectorizer in my experience, if you're lucky). |