Collaborative filtering is similar but for huge recommender systems they’re not going to create a huge MxN matrix where M is users and N is items. I think what they’re referring to would be called a “two tower” model where you have a learned vector for the user, a learned vector for the song, and the cosine similarity is their affinity. It’s pretty performant because you can cache the song vectors.