|
|
|
|
|
by jclem
1048 days ago
|
|
It seems worth noting that the input of gte-small is limited to 512 tokens. The tokenizers for sure aren’t the same, but I imagine this is significantly less than ada-002, whose input is limited to 8191 tokens. That said, I don’t imagine that embedding huge full documents is necessarily the right approach. I would love to see a comparison for some typical use cases using various methods of chunking input documents. |
|
For example in chatgpt-retrieval-plugin[0] repo default chunk size is just 200 tokens
this is anyway a limitation, no doubt, but chunking is pretty often used
[0] https://github.com/openai/chatgpt-retrieval-plugin/blob/main...