| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mschoch 4028 days ago

If you just want to segment larger blocks of text into tokens you can try the segment library (it implements the word boundary portion of unicode annex 29):

https://github.com/blevesearch/segment

If you need more manipulation of tokens after segmentation/tokenization, you could look at the analysis sub-package of bleve. Its intended to be able to be used indepenently of the rest of the library.

https://github.com/blevesearch/bleve