|
|
|
|
|
by shihab
499 days ago
|
|
yeah, that's very likely the explanation. All these functions are pretty high latency instructions, vs rejection sampling which only involves a multiplication. On Nvidia GPUs, mul has latency of 1-4 cycles while others are 16-32. |
|