Thanks for the kind words! We've been doing a lot more work in this direction too, for both compiler-based automatic parallelization [0] and a work-in-progress pmap successor for 'manual' parallelism (per-device code and explicit collectives) [1] which composes seamlessly with the compiler-based stuff.