Hacker News new | ask | show | jobs
by synergy20 1452 days ago
I mean SIMT(CUDA style), or SPMD, and also Tensorflow host|device runtime scheduling, it's a mystery to me how Nvidia etc schedule the AI workload(along with intrinsic) to achieve huge volume parallelism.
1 comments

For CUDA, iirc these tasks (kernel launches) are put into streams (similar to threads) and scheduled for execution by the driver. Within each kernel launch, each 32 threads, called a warp, are executed together in a single unit, skipping some instructions when they have different control flow. I think the driver perhaps schedule in a warp level, and these warps are executing similar things so can be scheduled together. I am not an expert in this so I am not sure if this is how they do it.
That's pretty much how that works, and they sync threads inside warp, I have been groping in the dark for a while, and always hoped someone can write up some details to help me to get the idea straight.

Intel, AMD, Google(TPU) all have their own way to schedule 'kernel's which are very different from CUDA, there are no details about them, I was just curious like 'how do they work across CPU|GPU'?

Thanks for the reply.

Thank you all for this. Scheduling on GPUs is a topic in the dark for me, to be discovered
GPU "threads" aren't exactly CPU threads, and GPU "cores" aren't really CPU cores. It's better to think of threads as SIMD instructions and cores as ALUs.

GPU execution order is typically either "immediate mode" or "tile mode", where tiles are more common on mobile GPUs, but Nvidia has also used them.

https://www.realworldtech.com/tile-based-rasterization-nvidi...

thank you for the details!