| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jonpon 1484 days ago

You are 100% correct, this is a toy example that I decided to put together for fun after talking to a bunch of people who mentioned it as a problem that they experience.

The main idea I wanted to add to the discussion (which is not that crazy of an addition) is that you can possible use sentence embedding instead of fuzzy matching on the actual letters to get more "domain expertise"

How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial

1 comments

smeeth 1484 days ago

Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard.

The point I was trying to make is that at scale one does not simply:

> compare these embeddings with all the other embeddings you have

You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.

Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.

link

code_biologist 1483 days ago

You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.

Any reason you couldn't just dump it in FAISS or Annoy? No need to do pairwise comparisons. Annoy claims to handle even up to 1,000 dimensions reasonably. https://github.com/facebookresearch/faiss/issues/95

link

smeeth 1483 days ago

Yeah, ANN algos provide a logical way to do blocking. It's not quite as simple as just running it though.

First, you have to be able to actually run the server. Most ANN algos (not DiskANN or SPTAG, but basically everything else) are run in memory, so if you've got really beefy data you'll need machines that can handle the load. The search engine companies and researchers using toy problems generally don't run into this problem.

Second, Approximate Nearest Neighbors are, unsurprisingly, approximate, so if you want the top N true neighbors you actually have to 1) collect the top K*N approximate neighbors then 2) run true cosine similarity on the K*N neighbors to refine down to N.

So your new O(kn)(not the same k or n, sorry I overloaded my variables) ER algorithm is basically:

- set up ANN server

- for each entity, collect K*N approximate neighbors

- refine to N true neighbors

- perform matching step

- somehow handle the fact that you have merged entities in your ANN server (this can be weirdly tricky).

Also, I might be splitting hairs, but ANN isn't REALLY performing a similarity measure. Its doing clustering or making search trees that try to guarantee the leaf nodes will be close to within some threshold. In that sense it has more in common with a fuzzy hash or blocking algorithm than it does with an exact similarity calculation.

As you may be able to tell, I have spent more time than I would like thinking about and implementing this sort of thing.

link