Pretty much what I did to let Codex with gpt5.4xhigh improve my fairly complex CUDA kernel which resulted in 20x throughput improvement.
https://pastebin.com/eac0SAYg