Hacker News new | ask | show | jobs
by mynegation 1482 days ago
I remember helping my little sister who got entity resolution (people’s names and company names) homework assignment for programming class 26 years ago (she is economics major and I am CS). That was infuriating and intellectually challenging at the same time. We came up with a combination of n-grams, Levenshtein distance, and common abbreviation (think “Inc.” and “Corp.”) canonicalization. It worked reasonably well.
1 comments

The reason why I love this problem is because of this! I feel like there are a lot of fun ways to be creative here, but as the other comments mentioned -- to get a scalable and really good solution is extremely difficult.