Hacker News new | ask | show | jobs
by rishsriv 1589 days ago
This looks pretty cool! Is this basically efficient/scalable fuzzy object matching?

IMO, it would be super useful to have some performance benchmarks – how fast is this for 1k/100k objects? How does that compare to other approaches etc

Not sure how feasible these are, but features I would find super useful:

- string matching across languages in different scripts (with something like unidecode maybe? [1])

- fuzzy matching that includes continuous variables like lat/long, age etc

Excited about using this – will be following the repo very closely!

[1] https://github.com/avian2/unidecode

1 comments

Hi rishsriv,

Thanks for liking zingg, super excited to hear this :-) Here are some performance numbers. https://docs.zingg.ai/docs/setup/hardwareSizing.html

We see performance varies by a) Number of attributes to match b) Size of data c) Type of matching and the features we compute for each d) Hardware and cluster size

Although we do not do matching across languages like English with Chinese, we have tested Zingg quite rigorously with Chinese, Japanese, Hindi, German and other languages and it seems to work out of the box. Likely due to the inbuilt Java unicode support and the ML based learning.

You make a great point about continuous variables like lat/long, age etc. Age seems to work, again due to integer differences and the learning. Have not tried lat/long yet. Would you have any dataset you could recommend for testing?

Thanks for pointing me to the performance numbers!

No open datasets that I'm aware of for fuzzy geocoordinate matching, unfortunately

Hmm..guess we will wait along and keep an eye on such datasets.