Hacker News new | ask | show | jobs
by sireat 1311 days ago
Thank you for a very helpful writeup!

Do you have any materials on word embedding strategies past Word2Vec? BERT and beyond?

I am currently working on a recommendation engine for a large library - original idea being to find "similar" documents - the funding comes from a plagiarism checking project.

I was slightly surprised how deceptively simple the widely cited winnowing paper is https://dl.acm.org/doi/10.1145/872757.872770 . The key idea being simple mod reduction of hashed fingerprints.

My project's goal is to find phrase level similarities to assist researchers.

It seems k-grams, n-grams, tf-idf and even Word2Vec is not going to cut it. A "smarter" context aware embedding is in order. My foray in training BERT from scratch was not very successful. - My corpora are not in English...

PS. As usual I spend most of the time on improving OCR quality and preprocessing corpora...

1 comments

For achieving a high accuracy for matching, it really comes down to details of your specific domain and dataset. Regarding the attempt of using large pre-trained language models to be able to find semantically similar documents, which is what you are attempting now, maybe try Whisper or other multilingual models, and then fine tune them on your dataset.

But a better bet might be actually turning looking into simpler embedding and methods and attempting to directly improve them by including some domain knowledge in the method or the process. Again, it is hard to judge what might work better just looking at the surface.

In case you really need to work with labeled datasets, set up a strong baseline, look into the active-learning methods and set up the loop, do a few iterations and try to predict if it will scale sufficiently fast to your target accuracy.