Hacker News new | ask | show | jobs
by nlplaylist 1248 days ago
TLDR: Lots of musical metadata converted into paragraphs and the SentenceTransformers Retrieve & Re-Rank Pipeline.(https://www.sbert.net/index.html)

The sentence embeddings are calculated using a Bidirectional Encoder Representation Transformer (BERT) model. There's a pre-trained model for this network trained on over 1 billion sentences from the internet that is publicly available, (thanks Microsoft) . The model transforms your description into a 784-long list of numbers (a vector) that represents the contextual meaning of your sentence.

The model runs off a dataset of musical metadata for 35,000 songs. As a "chronically online music nerd", I knew where to find it. The metadata is very rich, it has a lot of useful columns like the genres, subgenres, and descriptions of tracks. The numerical data is binned into categorical values like "obscure" mapping popularity between 0 and 10, "highly danceable" mapping danceability between 80 and 100, etc. The text data is modified into a coherent sentence: "this song's main genres are _____. this song is from the 80s. this name of this song is lovefool by the cardigans. etc"

An arduous part of the project was describing each musical genre in depth, with its own paragraph such that each genre's actual contextual meaning is captured and not just "This song is a Hyperpop song" or "This song is Adult Contemporary". It was a big exercise in music history and tested my knowledge of music. I also learned a lot about musical genres like "Mongolian Throat Singing" and how it compares to "Gamelan Throat Singing".

I also put the song lyrics for each song through GPT-3 and asked it to summarize the lyrical themes. That's also embedded and used in NLPlaylist.

Each feature for each song in our metadata dataset is now a big paragraph that describes the song. The paragraph is split up into sentences, and the embedding of each sentence is found. The final embedding for each song is then calculated by taking a weighted average over all sentence embeddings from the big paragraph and genre and lyrical embeddings.

To make your playlist, all that has to be done is compare the embedding of your query all 35,000 embeddings in the dataset and return the 100 most similar queries, using the cosine similarity distance metric. Thank god we have computers.

Once the 100 most similar candidate tracks are found, they are reranked using a "cross encoder" trained on 215M question-answer pairs from various sources and domains, including StackExchange, Yahoo Answers, Google & Bing search queries to give the best matches.

3 comments

Incredible work, and great explanation, thank you. Could you comment more on the cost of running this (e.g. per query or per hour, or however it's set up)? Where are you running the model from?

Also, could you provide more details on the cross encoder used for reranking?

Btw, if you already have all these song embeddings, it would be very interesting to be able to pick a song, and get all the similar ones in a playlist (sort of like "Song radio" on Spotify)!

Having worked with the same tech, I'd assume this is pretty inexpensive both to train and run. Probably in the low 1000s (maybe even mid 100s?) to train. knn search on 35k entries is pretty simple. The most expensive is probably the cross encoder (both to train and to run). Would also be interested to know
That's awesome, thank you so much for sharing all that. The project is great.
Very impressive. I'm running into an issue where if I try to tell it to exclude yacht rock, it gives me yacht rock. Is that hard to train?