Hacker News new | ask | show | jobs
by Slartie 2148 days ago
> Also notice that this model can be applied to other media formats: Text, pictures, audio ...

Actually, I don't think that this is as easy as you might think. The article goes into this a bit when saying that short video sequences are well-suited for such an algorithm because they provide a high frequency of "inputs" per time unit, but I think the article falls short of describing the other thing that makes videos particularly suitable (and, by extension, makes the assumption that "the TikTok algorithm" had a great future in many other places too, of which I am a bit more skeptical). This other critical thing is that video sequences in general also allow a huge variety of inputs to be gathered from consumption that text, pictures and audio can’t match.

- It is trivial to find out which part of a video a user has seen. This is nearly impossible to do reliably with textual content (assuming you don't have an eye tracker running).

- Instead of a still picture, a video provides much more things for the viewer to see. So instead of just knowing that in a picture there's a cat and you thus deduce the user likes cats, it's basically possible to split a video up in slices of which you know where there's a cat, and where there's a dog, and where there's whatever else, so from just that single video you might deduce info about the users' interest in cat/dog/whatever content all at once (depending on which parts a viewer has seen, which parts were skipped, at which point the viewer aborted, or at which point the like button was tapped).

- Video mostly also delivers audio, hence everything that you can gather from audio, like whether a user tends to prefer female or male voices, or which music style someone prefers, comes as a bonus when gathering info from video viewing

- If your videos' audio features someone speaking some text, you can speech-to-text that content and pump it into the usual machine learning modules, from simple sentiment analysis over trying to determine the topic someone talks about up to full-blown "trying to understand what this person is actually trying to say" and take that as an input for determining a viewers' interests. This is basically text analysis, so it lends itself to textual content as well, and audio too, but not so much to pictures.

Video is just really pumping out the maximum of all of these content formats in terms of potentially relevant data points about someones' interests, and it does so at really high frequency, especially if the length of each video is as short as on TikTok and thus the content producers have already performed the daunting work of condensing lots of content into the least number of seconds possible.

1 comments

I'd like to see someone write a blog post taking a shot at (speculatively) "reverse engineering" how the TikTok algorithm works (or may work)...like what attributes it might extract from a video (some of which you've mentioned above) and what it might do with them. Basically, how the overall thing may work, as well as how it may improve over time, taking into consideration current cutting edge ML techniques and speculative future capabilities.