| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mind-blight 799 days ago
	I suspect the biggest difference is the input data. Embeddings are great over datasets that look like FAQs and QA docs, or data that conceptually fits into very small chunks (tweets, some product reviews, etc). It does very badly over diverse business docs, especially with naive chunking. B2B use cases usually have old PDFs and word docs that need to be searched, and they're often looking for specific keywords (e.g. a person's name, a product, an id, etc). Vectors terms to do badly in those kinds of searches, and just returning chunks misses a lot of important details

1 comments

gdiamos 799 days ago

rare words are out of vocab errors in vectors

Especially if they aren’t in the token vocab

link

mind-blight 799 days ago

Even worse, named entities vary from organization to organization.

We have a client who uses a product called "Time". It's software time management. For that customer's documentation, time should be close to "product" and a bunch of other things that have nothing to do with the normal concept of time.

I actually suspect that people would get a lot more bang for their buck fine tuning the embedding models on B2B datasets for their use case, rather than fine tuning an llm

link

Yacovlewis 797 days ago

Great example of how an entity like that could throw effective RAG out the window

link