|
|
|
|
|
by gthompson512
392 days ago
|
|
How does it handle documents longer than the context length of the model? Sorry there are a ton of these regularly and they don't usually think about this. Edit: it seems like it just splits in to sentences which is a weird thing to do given in English only 95%ish percent agreement is even possible on what a sentence is.
```
// Process in batches
for batch in sentences.chunks(batch_size) {
// Truncate each sentence to max_length * median_token_length chars
let truncated: Vec<&str> = batch
.iter()
.map(|text| {
if let Some(max_tok) = max_length {
Self::truncate_str(text, max_tok, self.median_token_length)
} else {
text.as_str()
}
})
.collect();
``` |
|
Edit: That wasn't intended to be mean, although it may come off that way, but what is this supposed to be for? Myself I have text >8k tokens that need to be embedded and test things regularly.