Duff is relying on the fact you're allowed to intermingle the switch block and the loop in K&R C's syntax, the (common at the time but now generally frowned on or even prohibited in new languages) choice to drop-through cases if you don't explicitly break, and the related fact that C lets your loop jump back inside the switch.
Duff is trying to optimise MMIO, you wouldn't do anything close to this today even in C, not least because your MMIO is no longer similarly fast to your CPU instruction pace and for non-trivial amounts of data you have DMA (which Duff's hardware did not). In a modern language you also wouldn't treat "MMIO" as just pointer indirection, to make this stay working in C they have kept adding hacks to the type system rather than say OK, apparently this is an intrinsic, we should bake it into the freestanding mode of the stdlib.
Edited to add:
For my money the successor to Tom Duff's "Device" is WUFFS' "iterate loops" mechanism where you may specify how to partially unroll N steps of the loop, promising that this has equivalent results to running the main loop body N times but potentially faster. This makes it really easy for vectorisation to see what you're trying to do, while still handling those annoying corner cases where M % N != 0 correctly because that's the job of the tool, not the human.
The overarching point appears to be getting rid of angle brackets, which is not something that Duff is doing. Further, Duff's device keeps case labels on the left of its control structure; moving ifs to the left is the other "innovation" here.
I think you really have to squint your eyes to see the similarities, beyond the general theme of exploiting the counterintuitive properties of switch statements.
While it's convenient technically to have unified memory and so it makes a lot of sense for your machine code, in fact the MMIO isn't just memory, and so to make this work anyway in the C abstract machine they invented the "volatile" qualifier. (I assume you weren't involved back then?)
This should be a suite of intrinsics. It's the same mistake as "register" storage, a layer violation, the actual mechanics bleeding through into the abstract machine and making an unholy mess.
If you had intrinsics it's obvious where the platform specific behaviour lives. Can we "just" do unaligned 32-bit stores to MMIO? Can we "just" write one bit of a hardware register? It depends on your platform and so as an intrinsic it's obvious how to reflect this, whereas for a type qualifier we have no idea what the compiler did and the ISO document of course has to be vague to be inclusive of everybody.
I wasn't involved back then, but I know the history. I thought you were talking about something more recent.
But this is all opinions and terms such as "unholy mess" etc do not impress me. In my opinion "volatile" is just fine as is "register. Neither are layer violations nor a type system problem. That the exact semantics of a volatile access are implementation defined seem natural. How is this better with an intrinsic? What I would call a mess are the atomics intrinsics, which - despite being intrinsics - are entirely unsafe and dangerous and indeed mess (just saw a couple of new bugs in our bug tracker).
Duff is trying to optimise MMIO, you wouldn't do anything close to this today even in C, not least because your MMIO is no longer similarly fast to your CPU instruction pace and for non-trivial amounts of data you have DMA (which Duff's hardware did not). In a modern language you also wouldn't treat "MMIO" as just pointer indirection, to make this stay working in C they have kept adding hacks to the type system rather than say OK, apparently this is an intrinsic, we should bake it into the freestanding mode of the stdlib.
Edited to add:
For my money the successor to Tom Duff's "Device" is WUFFS' "iterate loops" mechanism where you may specify how to partially unroll N steps of the loop, promising that this has equivalent results to running the main loop body N times but potentially faster. This makes it really easy for vectorisation to see what you're trying to do, while still handling those annoying corner cases where M % N != 0 correctly because that's the job of the tool, not the human.