Well keyframes are a thing and it's likely that the still image is going to be encoded a few hundred times on an average length song. Especially in a HD video (which is what you want as the audio bitrate grows along with it) the video part could be massive compared to the actual music.
The only problem with music on youtube is the variability of the quality and the ability to find good-quality versions of the songs you want.
[1] https://github.com/rg3/youtube-dl