There's so much good meta-data (likes, comments, duration, sound used, views, like/view ratio, skips, loops, subscribes, etc.) that I'd be surprised if they were digging into the contents of the video at all right now.
They could also be digging only into audio, doing speech recognition on it, then clustering the text. Augment that with the text users have put into the video directly using the in-app editor and you have some pretty solid data.
If that were true, it'd be interesting to see if they push out support for close-captioning. It's an accessibility push, but also would leverage a lot of the same capabilities...
Would this have any advantage over just using video embeddings (or a sequence of frame embeddings?) which in theory should capture those things in vectorized form.
It can be a) very expensive b) also very difficult to implement.
Video understanding is an active field of research and I'm not sure state of the art is there yet for capturing nuance like engagement potential, categories etc.
Google was able to build a very useful search engine that ran for decades relying on the significance of links and keywords, without much understanding of the meaning of page content. You can get very far with the readily available data, before you need to delve into the fancy stuff to make it a few percent better.
That would be the "sound used". The music in the video is specified/labeled before upload so there's no need to actually process the sound of the video.