| HN Mirror

For achieving a high accuracy for matching, it really comes down to details of your specific domain and dataset. Regarding the attempt of using large pre-trained language models to be able to find semantically similar documents, which is what you are attempting now, maybe try Whisper or other multilingual models, and then fine tune them on your dataset.

But a better bet might be actually turning looking into simpler embedding and methods and attempting to directly improve them by including some domain knowledge in the method or the process. Again, it is hard to judge what might work better just looking at the surface.

In case you really need to work with labeled datasets, set up a strong baseline, look into the active-learning methods and set up the loop, do a few iterations and try to predict if it will scale sufficiently fast to your target accuracy.