|
|
|
|
|
by diasks2
4105 days ago
|
|
I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (https://github.com/diasks2/pragmatic_segmenter). I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: https://github.com/diasks2/pragmatic_segmenter#the-golden-ru... |
|