|
|
|
|
|
by rahimnathwani
1162 days ago
|
|
This will depend on the specific model you're using, because: - if a model has been trained on shorter paragraphs, it will likely do better on those than on longer ones, and vice versa - each model has some maximum input length (e.g. 512 tokens, or about 350 words), and might silently discard words when it's given a longer chunk I don't know whether or not processing multiple lengths is worthwhile, but you probably want to have some overlap when you turn your docs into chunks. Maybe take a look at Langchain or LlamaGPT: someone has probably come up with sensible defaults for overlap and chunk size. If you want to do embeddings locally, check out sentence-transformers/all-MiniLM-L6-v2 |
|