Given the author mentions multiple cores being available, I'd guess you could use any method, including MPI, to distribute the computation. But whether you used 1 core or 10k cores, it would be nice to have a 20x speedup on each core via this arithmetic/fixed size optimization. Since that's the focus of the article, communication technologies feel pretty unrelated.