Hacker News new | ask | show | jobs
by bigzyg33k 1119 days ago
Here's a (kinda) ELI5: you would use a language model to create "embeddings" of the text, which you can think of as a set of numbers representing the "meaning" of a set of characters.

These numbers can be plotted as points in a space, and embeddings of things with similar meanings are plotted close to each other. So things like "exam preparation" would have embeddings close to things like "top study tips".

Say you have created embeddings for a large corpus of text (in this case all youtube captions) once. If you create embeddings for a user query, you can search for embeddings close to it, and these will be "semantically" similar to the query.

The advantage is that unlike traditional full-text search, the user doesn't need a query that includes words present in the text.

1 comments

Do you have any resources that might guide one on doing something like this from scratch?
Here's a 6 minute speed run of something like that on weviate https://youtu.be/mBcBoGhFndY