| HN Mirror

The most complete documentation is in the applegpu repo[1] by dougallj showing a great deal of recent activity (including by alyssarosenzweig). Last I checked, the documentation of barrier instructions wasn't complete enough to tell whether these device-scoped barriers are possible. (Note: on RDNA2, they're accomplished by DLC and GLC flags on memory accesses, combined with cache flush instructions such as S_GL1_INV).

There's also a lot of great material, accessibly written, on Alyssa's blog[2], see in particular the posts titled "Dissecting the Apple M1 GPU, part ${I}".

[1]: https://github.com/dougallj/applegpu

[2]: https://rosenzweig.io/