|
|
|
|
|
by jonpon
1484 days ago
|
|
You are 100% correct, this is a toy example that I decided to put together for fun after talking to a bunch of people who mentioned it as a problem that they experience. The main idea I wanted to add to the discussion (which is not that crazy of an addition) is that you can possible use sentence embedding instead of fuzzy matching on the actual letters to get more "domain expertise" How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial |
|
The point I was trying to make is that at scale one does not simply:
> compare these embeddings with all the other embeddings you have
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.