Hacker News new | ask | show | jobs
by gpa 1751 days ago
The problem - correct string matching at scale. I am aware of fuzzy string matching. The problem is that the two strings can be > 90% similar even if the difference is, for example, one digit in the year of manufacturing. My current solution is to represent the 2 strings as similar as I can based on the available information by transforming (wrangling) the data to match the data as close as possible and then applying constraints based on make, model and year (they should be the same). It works pretty well, but I am looking for a more interactive (human-in-the-loop) solution.
1 comments

I'd just slap a GUI / audit logs on top. Show the intermediate data (the “wrangling”), show the computed similarities, show the conclusion (this met that threshold, and the other was equal, so it's category seven).
Can you elaborate on the technical details: which language, library or framework would you use?
Tkinter, probably. Or a web interface. Depends on what I'm doing, honestly – the answer will always be “whatever's currently being used”.