| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by trhway 1773 days ago

Typically deduplication is more like clean, standartize and match (using relatively simple match approaches). Our process was much more complex (more along the lines of what another commenter linked https://blog.acolyer.org/2020/12/14/entity-resolution/) with more rich resulting functionality, and, yes, marketing was the segment where it started though i'd say it was only about 3rd of the business at the peak of it. Yep, performance for most of the solutions is an issue. We had to do significant re-engineering at one point to parallelize at much finer granularity and thus were able to scale much more. Our biggest number of entities counting across all the sources in the largest implementation was just under 200M. It was right pre-Nehalem hardware.

The batch mode had naturally orders of magnitude higher throughput. We did have real-time single-record mode which was pretty fast as long as the stream of the incoming single-records wouldn't saturate the worker array capacity (here is the difference from serverless as the worker array was limited by whatever was statically configured at the moment as adding/removing nodes wasn't an instant on the fly operation)

Couple years later i worked at another company on a similar, though somewhat simpler, project when it was in the process of total rewrite for performance reason - the old version was really slow - that rewrite failed spectacularly for a lot of reasons. So, yes, performance is a kind of a noticeable factor in the domain.