Hacker News new | ask | show | jobs
by gpderetta 711 days ago
You want to use SIMD and multiple accumulators. In fact not only you want to use as many accumulators as the number of SIMD ALUs, as SIMD operations are usually longer latency you usually unroll SIMD loops for software pipelining, using more accumulators to break loop carried dependencies.