| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rahimnathwani 1162 days ago

This will depend on the specific model you're using, because:

- if a model has been trained on shorter paragraphs, it will likely do better on those than on longer ones, and vice versa

- each model has some maximum input length (e.g. 512 tokens, or about 350 words), and might silently discard words when it's given a longer chunk

I don't know whether or not processing multiple lengths is worthwhile, but you probably want to have some overlap when you turn your docs into chunks.

Maybe take a look at Langchain or LlamaGPT: someone has probably come up with sensible defaults for overlap and chunk size.

If you want to do embeddings locally, check out sentence-transformers/all-MiniLM-L6-v2