| Danger awaits all ye who enter this tutorial and have large datasets. The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example. ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook: 1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering. 2) Use fancy ML to determine matches within these blocks. If you try to combine 1 & 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking/filtering. |