Hacker News new | ask | show | jobs
by shihab 499 days ago
yeah, that's very likely the explanation. All these functions are pretty high latency instructions, vs rejection sampling which only involves a multiplication. On Nvidia GPUs, mul has latency of 1-4 cycles while others are 16-32.