Hacker News new | ask | show | jobs
by eklitzke 1424 days ago
A sufficiently smart compiler can obviously do any optimization that we can think of, but in practice the kind of optimization described in this article is very difficult to actually implement in an optimizing compiler. The CRC algorithm literally operates one bit at a time, in sequential order. As the article mentions, there are some properties of how the CRC math work that allow you to calculate the CRC from multiple chunks in a stream and then combine them in a specific way. This is the kind of thing that an optimizing compiler just isn't going to be able to figure out on its own given a naive CRC implementation because it's a math insight much moreso than it is a code generation trick.

In principle you could probably use some kind of symbolic math solving library to try to detect optimization opportunities like this in general code. In practice it just wouldn't be worth it, even at -O3, because it would add a ton of CPU and memory overhead to compilation and the optimization would be very rarely applicable in the first place.

1 comments

This "run two loops in parallel" pattern is an _incredibly_ common pattern for these "high speed benchmarks". A whole lot of different programs in the high-speed space (this CRC32, the lookup3 Jenkins Hash, My own AES random number generator, etc. etc.).

Instead of the programmer manually thinking of the "two independent pipelines", its quite possible that we can imagine a language where the two pipelines were programmed separately, and then a compiler merges them together knowing about the pipeline details (ex: Skylake is 1-per-clock throughput / 3-clock latency, AMD might be different like 1-per-clock throughput / 4-clock latency or something).

The programmer's job would be to "separate out the two independent threads of execution", while the compiler's job is to "merge them back together into one instruction stream".

Much like how SIMD-code is written today, the programmer is responsible for finding the parallelism. The compiler / machine is responsible for efficient execution.

--------

As it is, today's high speed programmers have to do both tasks simultaneously. We have a mental model of the internal registers, pipelines, throughput, latencies of different execution units, and manually schedule them to match our mental model. (But that mental model changes as new CPUs come out every few years).

The hard part is figuring out the parallelism. I don't think we have a language that describes this fine-grained parallelism though, in any case. Just a "what if we lived in a magically perfect world" kinda hypothetical here...

EDIT: Alternatively, you can "cut dependencies" and hope that the compiler discovers your (intended) low dependency chain. IE: Manually unroll loops and whatever, which works better in practice than you might think (and such manual unrolling often seems to trigger the AVX autovectorizer in my experience, if you're lucky).