| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by diasks2 4105 days ago
	I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (https://github.com/diasks2/pragmatic_segmenter). I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: https://github.com/diasks2/pragmatic_segmenter#the-golden-ru...

3 comments

Xeoncross 4105 days ago

Great work. I love the "Golden Rules" list you compiled. It seems like teams develop their NLP systems without sharing a common training set which leaves some teams without testing things like the "a.m. / p.m." thing.

link

diasks2 4104 days ago

See my comment below for some of the reasons I've had issues trying to test the commonly used segmentation corpora. I completely agree it would be great if there was a free (as in both speech and beer) common training set. One key would be that this common training set either provide the exact text that should be run in the segmenter or exact instructions on how to produce the text to run in the segmenter (re: see the issue I mentioned below of the ambiguity around how to actually test the Brown corpus).

link

kylebgorman 4104 days ago

For comparability, most people use the Penn Treebank-III WSJ data. Sections 03-06 are test, the remaining sections are train/dev.

Most methods are based on some sort of simple feature templates and machine learning, so they should generalize relatively well to a wide variety of languages, IMO.

link

ldng 4104 days ago

Not only there is little sharing, it is very focused on English.

link

edwintorok 4105 days ago

Worth taking a look at the unicode sentence segmentation algorithm rules: http://unicode.org/reports/tr29/#Sentence_Boundaries

Also at the CLDR sentence break supressions: http://unicode.org/cldr/trac/browser/tags/release-27-0-1/com...

If your rules treat an edge case that the above don't it'd probably be worth trying to suggest improvements to the unicode rules or the locale-specific ones.

link

kmike84 4105 days ago

Looks good!

Have you tried to evaluate your splitter on some other data, on this "typically used corpora"? The evaluation quality looks too optimistic - 98% / 100% quality means you made your code to work on your examples, but by using only a set of standartized tests you can't check:

* how broad is the coverage - there are other edge cases in real world, it may be impossible to cover them all;

* that the splitter doesn't make mistakes for real-world "regular" sentences (80-90% of sentences which are "the same").

The example set looks very good, and it looks like a good way to compare other sentence splitters. But it is not fair to provide evaluation metrics on the examples you used to develop your sentence splitter.

link

diasks2 4104 days ago

Good points. I'd love to test it on some of the typically used corpora. The issues I have are:

1) Most segmentation research papers are done by Universities which have access to the Penn Treebank data (WSJ and Brown corpus). However, the cost of that data is $1,700 https://catalog.ldc.upenn.edu/LDC99T42

2) The Brown corpus is available for free in NLTK (http://www.nltk.org/nltk_data/). However it is the tagged corpus. I've contacted the researchers for all of the top segmentation libraries but never received an answer to any of the following questions:

a) I’m assuming you preprocessed the text by removing the tags. Is this correct? Or did you use the untagged version, and if so do you have a link to that as I only found the tagged version in the NLTK data?

b) When removing the tags did you also remove each carriage return and newline so the text was one long string, each sentence separated by just one whitespace?

c) The download contains 100+ files. Did you analyze each individually? Or did you create one combined file? If you created a combined file how did you space each individual file within the larger file? Also, if you combined them what order did you combine them in?

So sure, all of these papers use the same data, but we have no idea if they are actually using that data in the same way, as none of the papers actually release their code and tests, or tell the steps they used to preprocess the corpus.

To test more broad coverage on my library I added the full text of Alice in Wonderland https://github.com/diasks2/pragmatic_segmenter/blob/master/s.... A grad student from Stanford offered to test my library on the WSJ corpus a few months ago which was very kind, but I'm still waiting to hear back on that.

link

vseloved 4103 days ago

Hi Kevin, thanks for great comments. I wanted to share a hack with you: Penn Treebank is included as part of OntoNotes which is free-of-charge :)

link