Hacker News new | ask | show | jobs
by jimmySixDOF 579 days ago
For a Regex approach take a look at the work from Jina.ai who among other things have a chunk/tokenizer [1] and now it's part of a bigger API service [2] also they developed an interesting late interaction (aka ColBERT like) chunking system that fits certain use cases. But the Regex is enough all by itself:

[1] https://gist.github.com/LukasKriesch/e75a0132e93ca989f8870c4...

[2] https://jina.ai/segmenter/