| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gthompson512 392 days ago
	How does it handle documents longer than the context length of the model? Sorry there are a ton of these regularly and they don't usually think about this. Edit: it seems like it just splits in to sentences which is a weird thing to do given in English only 95%ish percent agreement is even possible on what a sentence is. ``` // Process in batches for batch in sentences.chunks(batch_size) { // Truncate each sentence to max_length * median_token_length chars let truncated: Vec<&str> = batch .iter() .map(\|text\| { if let Some(max_tok) = max_length { Self::truncate_str(text, max_tok, self.median_token_length) } else { text.as_str() } }) .collect(); ```

1 comments

gthompson512 392 days ago

Sorry, looking more, it doesn't seem like you are doing what you are saying. This is just poorly breaking text into bad chunks with no regard for semantics and is like ~200 lines of actual code. What is this for? Most models can handle fairly large contexts.

Edit: That wasn't intended to be mean, although it may come off that way, but what is this supposed to be for? Myself I have text >8k tokens that need to be embedded and test things regularly.

link

Tananon 390 days ago

I think you are referring to for "batch in sentences.chunks(batch_size)"? This is not actually chunking sentences, chunks() is simply an iterator over a slice (in this case, a slice of all our input sentences of length batch_size). We don't have an actual constraint on input length. We truncate to 512 tokens by default, but you can easily set that to any amount by directly calling encode_with_args. There's an example in our quickstart: https://github.com/MinishLab/model2vec-rs/tree/main?tab=read....

link

stephantul 391 days ago

It doesn’t break text into chunks at all. These models can handle sequences of arbitrary length.

link

jasonjmcghee 391 days ago

I believe parent is referring to:

https://github.com/MinishLab/model2vec-rs/blob/480ec988d7f4a...

link