|
|
|
|
|
by bronxbomber92
2239 days ago
|
|
A couple of small comments. 1. The single pass prefix sum the author implements is not legal Vulkan code. The algorithm requires spinning on conditions to be satisfied by concurrently running threads. This is allowed to deadlock in Vulkan and is mentioned explicitly in the "Limited Forward Progress Guarantees" section of the Vulkan Memory Model blog post linked to by the author. The only GPUs on which is it technically safe to run this algorithm are the Volta and newer NVIDIA architectures. 2. The mask argument of the CUDA shuffle intrinsics is there, not because it is a more powerful abstraction, but because compiler technology of today is unable to prevent illegal code motion of shuffle intrinsics. There are threads on the LLVM mailing list that discuss this in some detail. 3. I am happy to see this topic written about. Portability of compute algorithms and where APIs differ or lack is not well understood and CUDA tends to dominate the discussions. Regarding subgroup portability - I do wonder if asking developers to write their code for specific subgroup sizes and allowing them to request subgroup sizes is the correct long term abstraction. In Metal you can get a long way writing code that uses subgroup ('simdgroup' in Metal speak) functionality without actually making any assumptions about the subgroup size. It is a balance - there will always be performance to be had if you hard code such assumptions, but the more we can close the gap, the less it makes sense to do so. |
|
On 1, I'm wondering if there's some confusion, especially as I don't see what difference Volta makes here. I accept that the Vulkan spec makes no guarantees about thread scheduling fairness, and that one could write an adversarial but spec-compliant implementation which schedules the atomic bump of the first workgroup to claim the first partition, then context switches to the second workgroup and attempts to run it to completion, which will deadlock.
But I think it is reasonable to make the assumption, for an extremely wide range of hardware, that when the atomic bump is scheduled, the code to compute the partition aggregate will make forward progress at least to the point where it writes that aggregate, which is all that is required for completion. Note that this all of this is about a workgroup depending on the progress of another workgroup. Obviously if invocations within a workgroup were depending on the progress of other invocations within the same workgroup, that would be a problem, but of course there are control barriers for that.
I don't see how it is any more technically safe to run this on Volta than any other hardware. If there is a document that guarantees stronger forward progress than mandated by the spec, I'm unaware of it.
From my reading of the spec, there is no way to write a one-pass prefix sum that is technically guaranteed deadlock-free. I'd argue that this is a defect of the spec, as there's a gap between useful guarantees that can safely be made by all real hardware and what the spec provides. I'll allow that fixing this defect is not easy. Until then, I think a good strategy is to ignore the defect.
On 3, while I hope to avoid code duplication, so far our track record has been pretty rough. For piet-metal, I tried writing the kernel in subgroup size adaptive way, but had problems with size 64 subgroups, as that requires dealing with 64 bit ballot values, and I never got that to work (I don't have 64 bit hardware in front of me, so had to iterate by asking other people to test). Also, for Brian's work on 32x32 boolean matrix transpose, we tried a hybrid approach using both subgroups and thread local storage, and performance was really disappointing. But, armed with this experience, perhaps we can do better going forward.