| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hivacruz 752 days ago
	Did you do use the same method, i.e. split by chunks each article and vectorize each chunk?

2 comments

dudus 752 days ago

That's the only way to do it. You can't index the whole thing. The challenge is chunking. There are several different algorithms to chunk content for vectorization with different pros and cons.

link

minimaxir 752 days ago

You can do much bigger chunks with models that support RoPE embeddings, such as nomic-embed-text-1.5 which has a 8192 context length: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

In theory this would be an efficiency boost but the performance math can be tricky.

link

qudat 752 days ago

As far as I understand it, context length degrades llm performance, so just because an llm "supports" a large context length it basically just clips a top and bottom chunk and skips over the middle bits.

link

rahimnathwani 752 days ago

Why would you want chunks that big for vector search? Wouldn't there be too much information in each chunk, making it harder to match a query to a concept within the chunk?

link

nostrebored 752 days ago

The problem is that often semantic meaning depends on state multiple paragraphs or sections away.

This is a coarse way to tackle that

link

gfourfour 752 days ago

Yes

link