|
|
|
|
|
by Olreich
975 days ago
|
|
The “secret sauce” is Word2Vec. Take a snapshot of all the text you can find on the internet, then with a sliding context window, vectorize each word based on what words are around it. The core assumption is that words with similar context have similar meaning. How you decide which components represent which words in the context is unclear, but it looks like we’re doing some kind of ML training to convince a computer to decide for us. Here’s a paper about the technique which might help: https://arxiv.org/pdf/1301.3781.pdf Once all the words have vectors you can assume that there’s meaning in there and move on to trying to math these vectors against each other to find interesting correlations. It looks like the scoring for the initial training is based on making the vectors computable in various ways, so you can likely come up with a comparability criteria different than the papers use and get a more useful vectorization for your own purposes. Seems like cosine similarity is good enough for most things though. |
|