the tldr is that video formats are relationship mediocre for images because the tradeoffs you make to compress an image that will be seen for 1/60th of a second are different than those for a static image.
Your point is not wrong in general but I want to nitpick that webp and avif are based on video compression techniques for I-frames which while themselves only shown for a single 1/FPS duration will impact a longer time slice of the video as other frames will reference them.