The most complete documentation is in the applegpu repo[1] by dougallj showing a great deal of recent activity (including by alyssarosenzweig). Last I checked, the documentation of barrier instructions wasn't complete enough to tell whether these device-scoped barriers are possible. (Note: on RDNA2, they're accomplished by DLC and GLC flags on memory accesses, combined with cache flush instructions such as S_GL1_INV).
There's also a lot of great material, accessibly written, on Alyssa's blog[2], see in particular the posts titled "Dissecting the Apple M1 GPU, part ${I}".