| HN Mirror

Thanks for replying! I don't tend to do that type of work anymore, but I'm still stoked to see a solution to the problem I had frequently. I think there's a great service to be built (and maybe it's yours!) that deduplicates data.

Specific models might be an interesting addon. Address parsing, normalization, and deduplication (with potential covariates like phone number, email address, etc.) is a massive pain in the ass for any data engineer who works with sales or marketing folks. Their databases (CRMs) are awful -- it was always a chore to clean these up, but measurably saved money (imagine you mail physical cards, and only want 1 per customer... but you have 5 different contacts at that company for 3 unique individuals).

I would have paid for a deduplication service -- say, quarterly batches at somewhere >$500/quarter for e.g. 20-50k contacts.

The 1-size-fits-all isn't really a value add for me, that wasn't so much my issue. For other target users, I can see that use -- for them, the interface is the value add. Especially if you can read/write Excel files directly.

Stop words aren't something I used in my deduplication efforts. How many of your users request or use this? What kind of stop words do you want to exclude from comparing two entries? I would be worried that stopwords still carry information: "The Store" versus "Store" might be significant.