Pytorch is an animal by itself when you try to put it into production. They have started addressing it with torch 2.0 but it still has lengths to go. With this you can switch to TFserve if you have usual architecture.
My understand of Triton is more that this is an alternative to CUDA, but instead you write it directly in Python, and on a slightly higher-level, and it does a lot of optimizations automatically. So basically: Python -> Triton-IR -> LLVM-IR -> PTX.