Hacker News new | ask | show | jobs
by gpderetta 1075 days ago
Indeed, the blocked vectorization with 8 bits accumulators shown elsethread is going to be faster and there reducing the sum to 1 bit per iteration is worth it.