i found it pretty inspiring. not only was it a neat example of coding with instruction-level parallelism in mind, but a concrete demonstration that it did indeed provide significant speedups in a piece of code that people have been using for decades (and which had presumably already been optimised for the single-everything cpu case)