|
|
|
|
|
by mschoch
4028 days ago
|
|
If you just want to segment larger blocks of text into tokens you can try the segment library (it implements the word boundary portion of unicode annex 29): https://github.com/blevesearch/segment If you need more manipulation of tokens after segmentation/tokenization, you could look at the analysis sub-package of bleve. Its intended to be able to be used indepenently of the rest of the library. https://github.com/blevesearch/bleve |
|