| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by winwang 546 days ago
	Here's a rather trivial example of using PTX: https://docs.nvidia.com/cuda/parallel-thread-execution/#spec... For various micro-bench reasons I wanted to use a global clock instead of an SM-local one, and I believe this was needed. Also note that even CUDA has "lower level"-like operations, e.g. warp primitives. PTX itself is super easy to embed in it like asm.