for LLM work, reading the Flash Attention and vLLM kernel source taught me more than any book. real code makes memory hierarchy concrete — books stay too abstract.
The story of Flash Attention is the best manifestation of power and difficulty of GPU programming. This page gives a nice overview of it https://aiwiki.ai/wiki/flash_attention