|
Thanks, especially the second paragraph. I'll probably update the post based on that (I've made several updates already). On 1, I'm wondering if there's some confusion, especially as I don't see what difference Volta makes here. I accept that the Vulkan spec makes no guarantees about thread scheduling fairness, and that one could write an adversarial but spec-compliant implementation which schedules the atomic bump of the first workgroup to claim the first partition, then context switches to the second workgroup and attempts to run it to completion, which will deadlock. But I think it is reasonable to make the assumption, for an extremely wide range of hardware, that when the atomic bump is scheduled, the code to compute the partition aggregate will make forward progress at least to the point where it writes that aggregate, which is all that is required for completion. Note that this all of this is about a workgroup depending on the progress of another workgroup. Obviously if invocations within a workgroup were depending on the progress of other invocations within the same workgroup, that would be a problem, but of course there are control barriers for that. I don't see how it is any more technically safe to run this on Volta than any other hardware. If there is a document that guarantees stronger forward progress than mandated by the spec, I'm unaware of it. From my reading of the spec, there is no way to write a one-pass prefix sum that is technically guaranteed deadlock-free. I'd argue that this is a defect of the spec, as there's a gap between useful guarantees that can safely be made by all real hardware and what the spec provides. I'll allow that fixing this defect is not easy. Until then, I think a good strategy is to ignore the defect. On 3, while I hope to avoid code duplication, so far our track record has been pretty rough. For piet-metal, I tried writing the kernel in subgroup size adaptive way, but had problems with size 64 subgroups, as that requires dealing with 64 bit ballot values, and I never got that to work (I don't have 64 bit hardware in front of me, so had to iterate by asking other people to test). Also, for Brian's work on 32x32 boolean matrix transpose, we tried a hybrid approach using both subgroups and thread local storage, and performance was really disappointing. But, armed with this experience, perhaps we can do better going forward. |
The property you're describing is discussed in https://johnwickerson.github.io/papers/forwardprogress_concu..., page 1:10:
"After probing our zoo of GPUs with litmus tests, we determined that all GPUs had the following property: once a work-group begins executing a kernel (i.e. the work-group becomes occupant on a hardware resource), it will continue executing until it reaches the end of the kernel. We call this execution model the occupancy bound execution model, because the number of work-groups for which relative forward progress is guaranteed is bound by the hardware resources available for executing work-groups; i.e. the hardware resources determine how many work-groups can be occupant."
However, I know even that property is not true of all GPUs and thus cannot be assumed by Vulkan spec compliant code. (I actually doubt it's true of any GPU's pre-Volta - it's just that the progress needed is often a natural side-effect of switching between threads for latency hiding).
Volta does guarantee this property + more. All starvation-free algorithms are supported by the architecture. See https://devblogs.nvidia.com/inside-volta/. In particular, this line:
"Independent thread scheduling in Volta ensures that even if a thread T0 currently holds the lock for node A, another thread T1 in the same warp can successfully wait for the lock to become available without impeding the progress of thread T0."
Independent thread scheduling is often thought about as providing a deadlock freedom guarantee between threads in the same warp, but the same guarantees must also extend to coarse execution scopes (e.g. a workgroup waiting on another workgroup).
So, as far as Vulkan is concerned, I do not consider this a defect in the spec. I have seen this exact algorithm deployed in production and deadlock despite never having deadlocked at the developer's desk.
On 3 - sounds like another good blog post topic. It'd be interesting to hear your experience trying to (and being unable to) write shaders that are subgroup agnostic but use subgroup functionality. I should Brian's blog post on the matrix transpose as well, maybe some of that experience is documented there.