| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by snyy 204 days ago

Which language are you thinking of? Ideally, how would you identify split points in this language?

I suppose we've only tested this with languages that do have delimiters - Hindi, English, Spanish, and French

There are two ways to control the splitting point. First is through delimiters, and the second is by setting chunk size. If you're parsing a language where chunks can't be described by either of those params, then I suppose memchunk wouldn't work. I'd be curious to see what does work though!

2 comments

smlacy 204 days ago

There are certainly cases of Greek/Latin without any punctuation at all, typically in a historical context. Chinese & Japanese historically did not have any punctuation whatsoever.

link

ks2048 204 days ago

Do the delimiters have to be single bytes? e.g. Japanese full stop (IDEOGRAPHIC FULL STOP) is 3 bytes in UTF-8.

link

snyy 204 days ago

No, delimiters can be multiple bytes. They have to be passed as a pattern.

// With multi-byte pattern

let metaspace = "<japanese_full_stop>".as_bytes();

let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();

link