Hacker News new | ask | show | jobs
by bravura 582 days ago
One thing I've been looking for, and was a bit tricky implementing myself to be very fast, is this:

I have a particular max token length in mind, and I have a tokenizer like tiktoken. I have a string and I want to quickly find the maximum length truncation of the string that is <= target max token length.

Does chonkie handle this?

1 comments

I don't fully understand what you mean by "maximum length truncation of the string"; but if you're talking about splitting the sentence into 'chunks' which have token counts less than a pre-specified max_token length then, yes!

Is that what you meant?

I'm not sure if this is what they mean, but this is a use case that I have dealt with and had to roll my own code for:

Given a list sentences, find the largest in order group of sentences which fit into a max token length such that the sentences contain a natural coherence.

In my case I used a fuzzy token limit and the chunker would choose a smaller group of sentences that fit into a single paragraph or a single common structure instead of cramming every possible sentence until it ran out of room. It would do the same going over the limit if it would be beneficial to do so.

A simple example would be having an alphabetized set and instead of making one chunk A items through part of B items it would end at A items with tokens to spare, or if it were only an extra 10% it would finish the B items. Most of the time it just decided to use paragraphs to end chunks instead of continuing into the middle of the next one.