| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Kelamir 1104 days ago
	> We start by parsing documents into chunks. A sensible default is to chunk documents by token length, typically 1,500 to 3,000 tokens per chunk. However, I found that this didn’t work very well. A better approach might be to chunk by paragraphs (e.g., split on \n\n). Hmm good insight there. I've done some experimenting formerly by chunk length and it's been pretty troublesome due to missing context.

2 comments

gwern 1104 days ago

You don't do a sliding window? That seems like the logical way to maintain context but allow look up by 'chunks'. Embed it, say, 3 paragraphs at a time, advancing 1 paragraph per embedding.

link

chaxor 1104 days ago

This is only a good idea if you are *specifically not* using OpenAI.

If you use local models then it's a fantastic idea.

link

screye 1104 days ago

If you're concatenating after chunking , then the overlapping windows add quite a lot of repetition. Also, if it cuts off mid-json / mid-structured output then overlapping windows once again cause issues.

Define a custom recursive text splitter in langchain, and do chunking heuristically. It works a lot better.

That being said, it is useful to maintain some global and local context. But, I wouldn't use overlapping windows.

link

rahimnathwani 1104 days ago

In place of simply concatenating after chunking, a more effective approach might be to retrieve and return the corresponding segments from the original documents that are relevant to the context. For instance, if we're dealing with short pieces of text such as Hacker News comments, it's fairly straightforward. Any partial match can prompt the return of the entire comment as it is.

When working with more extensive documents, the process gets a bit more intricate. In this case, your embedding database might need to hold more information per entry. Ideally, for each document, the database should store identifiers like the document ID, the starting token number, and the ending token number. This way, even if a document appears more than once among the top results from a query, it's possible to piece together the full relevant excerpt accurately.

link

gwern 1104 days ago

I don't think the repetition is a problem. He's using a local model for human-assisted writing with pre-generated embeddings - he can use essentially an arbitrary number of embedding calls, as long as it's more useful for the human. So it's just a question of whether that improves the quality or not. (Not that the cost would be more than a rounding error to embed your typical personal wiki with something like the OA API, especially since they just dropped the prices of embeddings again.)

link

SmooL 1104 days ago

I've thought about doing this as well, but I haven't tried it yet. Are there any resources/blogs/information on various strategies on how to best chunk & embed arbitrary text?

link

busseio 1104 days ago

I’ve been experimenting with sliding window chunking using SRT files. They’re the subtitle format for television and have 1 to _n_ sequence numbers for each chunk, along with time stamps for when the chunk should appear on the screen. Traditionally it’s two lines of text per chunk but you can make chunks of other line counts and sizes. Much of my work with this has been with SRT files that are transcriptions exported from Otter.ai; GPT-3.5 & 4 natively understand the SRT format and the concepts of the sequence numbers and time stamps, so you can refer to them or ask for confirmation of them in a prompt.

link

crucialfelix 1104 days ago

The unstructured package works well to partition text, markdown, html, even pdf on structural boundaries like paragraphs, h, hr etc

https://unstructured-io.github.io/unstructured/bricks.html#p...

link