|
|
|
|
|
by bravura
1040 days ago
|
|
The most straightforward way, but of course you can fiddle around a lot: You use sentence transformers (https://www.sbert.net/). You use a strong baseline like all-MiniLM-L6-v2. (Or you get more fancy with something from the Massive Text Embedding Benchmark, https://huggingface.co/spaces/mteb/leaderboard) You break your text into sentences or paragraphs with no more than 512 tokens (according to the sentence transformers tokenizer). You embedding all your texts and insert them into your vector DB. |
|