It can be a) very expensive b) also very difficult to implement.
Video understanding is an active field of research and I'm not sure state of the art is there yet for capturing nuance like engagement potential, categories etc.
Google was able to build a very useful search engine that ran for decades relying on the significance of links and keywords, without much understanding of the meaning of page content. You can get very far with the readily available data, before you need to delve into the fancy stuff to make it a few percent better.
Video understanding is an active field of research and I'm not sure state of the art is there yet for capturing nuance like engagement potential, categories etc.