Hacker News new | ask | show | jobs
by professionalguy 2340 days ago
In data engineering, I think your responses to the 'entity resolution' problem are a good Dunning Kurger style litmus test.

If you don't know, entity resolution is the process of matching unique rows in two or more databases. Are these the same movies? Are these the same person?

Novice DE: Oh easy, just merge on the name.

Intermediate DE: OH GOD NO. <michael_scott_no.jpg>

Expert DE: That's complicated, but I have a plan.

1 comments

Just curious, is there a standard way to start attacking that problem?
You always need some sort of data normalization scheme, and one that makes sense for the task you're running.

(This including things such as Unicode normalization and looking at other fields to determine if it's the same thing.)

And you get to handle duplicates too.

That is just the start, problem gets even more interesting in a real sharded scenario because eventual consistency is hard.