Hacker News new | ask | show | jobs
by dsalaj 1313 days ago
I have been working on an entity matching solution for two years now, and I have decided to write down some of the learning I picked up along the way. Turns out there are too many relevant details to cover in a single post, so I will cover the topic in multiple parts.

This first part is the high-level introduction, useful for project planning and architecture decisions that need to be made early in the development process. Any feedback is welcome, along with wishes for the follow-up parts if you have something specific that you would like to be covered.

2 comments

Thank you for a very helpful writeup!

Do you have any materials on word embedding strategies past Word2Vec? BERT and beyond?

I am currently working on a recommendation engine for a large library - original idea being to find "similar" documents - the funding comes from a plagiarism checking project.

I was slightly surprised how deceptively simple the widely cited winnowing paper is https://dl.acm.org/doi/10.1145/872757.872770 . The key idea being simple mod reduction of hashed fingerprints.

My project's goal is to find phrase level similarities to assist researchers.

It seems k-grams, n-grams, tf-idf and even Word2Vec is not going to cut it. A "smarter" context aware embedding is in order. My foray in training BERT from scratch was not very successful. - My corpora are not in English...

PS. As usual I spend most of the time on improving OCR quality and preprocessing corpora...

For achieving a high accuracy for matching, it really comes down to details of your specific domain and dataset. Regarding the attempt of using large pre-trained language models to be able to find semantically similar documents, which is what you are attempting now, maybe try Whisper or other multilingual models, and then fine tune them on your dataset.

But a better bet might be actually turning looking into simpler embedding and methods and attempting to directly improve them by including some domain knowledge in the method or the process. Again, it is hard to judge what might work better just looking at the surface.

In case you really need to work with labeled datasets, set up a strong baseline, look into the active-learning methods and set up the loop, do a few iterations and try to predict if it will scale sufficiently fast to your target accuracy.

Thank you for this writeup. Having done some work on deduplication/matching systems, my experience (as with many things in data science) is that there are a lot of things consider and there is no single best solution. Hopefully you are able to keep up with this series, because I think it will be very helpful to many people.