|
|
|
|
|
by winwang
498 days ago
|
|
Here's a rather trivial example of using PTX: https://docs.nvidia.com/cuda/parallel-thread-execution/#spec... For various micro-bench reasons I wanted to use a global clock instead of an SM-local one, and I believe this was needed.
Also note that even CUDA has "lower level"-like operations, e.g. warp primitives. PTX itself is super easy to embed in it like asm. |
|