|
|
|
|
|
by BiteCode_dev
712 days ago
|
|
I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well. Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/). With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G: https://pypi.org/project/rensa/ Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top. |
|
I've also written up some interactive tutorials on how the method works [1] if anyone's interested
[0]https://github.com/moj-analytical-services/splink [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/