GPU race is getting really hot and there is a lot of work being done to squeeze every ounce of performance especially for LLM training and inference.
One resource I would recommend is “Programming massively parallel processors” [1]
I am also learning it as my hobby project and uploading my notes here [2]
[1] https://shop.elsevier.com/books/programming-massively-parall...
[2] https://github.com/mandliya/PMPP_notes