Hacker News new | ask | show | jobs
by smeeth 1477 days ago
Danger awaits all ye who enter this tutorial and have large datasets.

The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example.

ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:

1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering.

2) Use fancy ML to determine matches within these blocks.

If you try to combine 1 & 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking/filtering.

5 comments

So much this. I work on spatiotemporal entity resolution and it requires extreme care and cleverness to not turn it into a proverbial "boiling the ocean" compute problem for all but the most trivial cases. The implied state that needs to be managed alone is often effectively intractable. At one time supercomputers were used for large-scale entity resolution problems; they may still.
The beautiful thing about spatiotemporal entities is that physical time provides "out of the box common sense" decent locality, and physical space excellent locality, for the blocking process your parent comment mentions.

It sucks to work on a whole product catalog or rolodex.

Cool, TIL about blocking/ filtering. We do entity resolution such that if we have a "process id" (ex: 1234) in two different event logs we can determine if they're the same process (since pids get reused). We don't have to compare every process to every other process, only ones that share the same pid, which drastically reduces the data size, and then we do filtering on top of that based on other optional attributes.

When I wrote this system I didn't know what ER even was, an article like this would have helped a lot, even just the first line defining ER would have helped a lot.

You are 100% correct, this is a toy example that I decided to put together for fun after talking to a bunch of people who mentioned it as a problem that they experience.

The main idea I wanted to add to the discussion (which is not that crazy of an addition) is that you can possible use sentence embedding instead of fuzzy matching on the actual letters to get more "domain expertise"

How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial

Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard.

The point I was trying to make is that at scale one does not simply:

> compare these embeddings with all the other embeddings you have

You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.

Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.

You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.

Any reason you couldn't just dump it in FAISS or Annoy? No need to do pairwise comparisons. Annoy claims to handle even up to 1,000 dimensions reasonably. https://github.com/facebookresearch/faiss/issues/95

Yeah, ANN algos provide a logical way to do blocking. It's not quite as simple as just running it though.

First, you have to be able to actually run the server. Most ANN algos (not DiskANN or SPTAG, but basically everything else) are run in memory, so if you've got really beefy data you'll need machines that can handle the load. The search engine companies and researchers using toy problems generally don't run into this problem.

Second, Approximate Nearest Neighbors are, unsurprisingly, approximate, so if you want the top N true neighbors you actually have to 1) collect the top K*N approximate neighbors then 2) run true cosine similarity on the K*N neighbors to refine down to N.

So your new O(kn)(not the same k or n, sorry I overloaded my variables) ER algorithm is basically:

- set up ANN server

- for each entity, collect K*N approximate neighbors

- refine to N true neighbors

- perform matching step

- somehow handle the fact that you have merged entities in your ANN server (this can be weirdly tricky).

Also, I might be splitting hairs, but ANN isn't REALLY performing a similarity measure. Its doing clustering or making search trees that try to guarantee the leaf nodes will be close to within some threshold. In that sense it has more in common with a fuzzy hash or blocking algorithm than it does with an exact similarity calculation.

As you may be able to tell, I have spent more time than I would like thinking about and implementing this sort of thing.

This is my favourite paper on the different indexing (blocking) techniques - http://users.cecs.anu.edu.au/~Peter.Christen/publications/ch...

Very easy to read and explains the tradeoffs of the algorithms

Great advice! What is your rough guesstimate - in big O - of a general least upper bound of what you can do with ER?
Caveat: just my personal hunch.

My suspicion is O(kn), where k is both constant and rather large. This would be highly reliant on an extremely good blocking algorithm. Maybe trending towards O(nlogn)? I don't know. It's really tough.

The blocking algorithm for this hypothetical solution would most likely be an approximate hashing algorithm sort of resembling a bloom filter. The input data would have to be amenable to hashing, like a single string field. Add multiple fields and shit just got complicated fast. Look up q-gram blocking if you want to read a sort of simple example of a blocking algorithm like this.

Thinking about this, I think you would end up with something like O(s^2*b) where s is the size of your biggest block (e.g. all the "Smith"s or company names containing "Inc." if that doesn't get removed), and b is the number of blocks. This would be at least n, so O(kn) with a big k makes some sense. If you have some way to get away with not comparing every element in a block to every other element, e.g. exact string match only I think you could get away with O(s*log(s)*b), but most of the time the matches are not going to be good enough.