Hacker News new | ask | show | jobs
by gdiamos 799 days ago
rare words are out of vocab errors in vectors

Especially if they aren’t in the token vocab

1 comments

Even worse, named entities vary from organization to organization.

We have a client who uses a product called "Time". It's software time management. For that customer's documentation, time should be close to "product" and a bunch of other things that have nothing to do with the normal concept of time.

I actually suspect that people would get a lot more bang for their buck fine tuning the embedding models on B2B datasets for their use case, rather than fine tuning an llm

Great example of how an entity like that could throw effective RAG out the window