I don't think there's such a thing as key frames here, just frames. And if you run SD through every frame, the output will be janky because SD doesn't know about temporal coherence.
As soon as it's an MP4 it will have key frames all right. You could add AI upscaling to your encoder. People are making fun of "Just", but I believe I could take apart ffmpeg to add this feature (PoC) in two weeks or less. Provide somebody pays for my labor and for the HW.
Adding the word 'just' doesnt make it any easier. Something I've noticed is that people who have never done something themselves and are telling someone to do an difficult task, will use: