I mean SIMT(CUDA style), or SPMD, and also Tensorflow host|device runtime scheduling, it's a mystery to me how Nvidia etc schedule the AI workload(along with intrinsic) to achieve huge volume parallelism.
For CUDA, iirc these tasks (kernel launches) are put into streams (similar to threads) and scheduled for execution by the driver. Within each kernel launch, each 32 threads, called a warp, are executed together in a single unit, skipping some instructions when they have different control flow. I think the driver perhaps schedule in a warp level, and these warps are executing similar things so can be scheduled together. I am not an expert in this so I am not sure if this is how they do it.
That's pretty much how that works, and they sync threads inside warp, I have been groping in the dark for a while, and always hoped someone can write up some details to help me to get the idea straight.
Intel, AMD, Google(TPU) all have their own way to schedule 'kernel's which are very different from CUDA, there are no details about them, I was just curious like 'how do they work across CPU|GPU'?
GPU "threads" aren't exactly CPU threads, and GPU "cores" aren't really CPU cores. It's better to think of threads as SIMD instructions and cores as ALUs.
GPU execution order is typically either "immediate mode" or "tile mode", where tiles are more common on mobile GPUs, but Nvidia has also used them.