| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dahart 1248 days ago
	It might be slower to use a memory access, even if you assume an average L1 cache hit. You can pipeline the bitwise version and maybe achieve an amortized divide in one cycle, where (depending on process and many other things) the lookup table might peak at like 4 cycles per divide even in unrolled/parallel situations. Plus the technique in the article is good for just about any constant divisor and if generalized works on much larger ranges, it’s why many compilers use this trick when they can.