| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jclem 1048 days ago
	It seems worth noting that the input of gte-small is limited to 512 tokens. The tokenizers for sure aren’t the same, but I imagine this is significantly less than ada-002, whose input is limited to 8191 tokens. That said, I don’t imagine that embedding huge full documents is necessarily the right approach. I would love to see a comparison for some typical use cases using various methods of chunking input documents.

1 comments

should be yes, but even in examples from openai, they usually do splitting into chunks

For example in chatgpt-retrieval-plugin[0] repo default chunk size is just 200 tokens

this is anyway a limitation, no doubt, but chunking is pretty often used