|
|
|
|
|
by dsalaj
1313 days ago
|
|
I have been working on an entity matching solution for two years now, and I have decided to write down some of the learning I picked up along the way. Turns out there are too many relevant details to cover in a single post, so I will cover the topic in multiple parts. This first part is the high-level introduction, useful for project planning and architecture decisions that need to be made early in the development process. Any feedback is welcome, along with wishes for the follow-up parts if you have something specific that you would like to be covered. |
|
Do you have any materials on word embedding strategies past Word2Vec? BERT and beyond?
I am currently working on a recommendation engine for a large library - original idea being to find "similar" documents - the funding comes from a plagiarism checking project.
I was slightly surprised how deceptively simple the widely cited winnowing paper is https://dl.acm.org/doi/10.1145/872757.872770 . The key idea being simple mod reduction of hashed fingerprints.
My project's goal is to find phrase level similarities to assist researchers.
It seems k-grams, n-grams, tf-idf and even Word2Vec is not going to cut it. A "smarter" context aware embedding is in order. My foray in training BERT from scratch was not very successful. - My corpora are not in English...
PS. As usual I spend most of the time on improving OCR quality and preprocessing corpora...