| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Xeoncross 4139 days ago
	Great work. I love the "Golden Rules" list you compiled. It seems like teams develop their NLP systems without sharing a common training set which leaves some teams without testing things like the "a.m. / p.m." thing.

3 comments

diasks2 4139 days ago

See my comment below for some of the reasons I've had issues trying to test the commonly used segmentation corpora. I completely agree it would be great if there was a free (as in both speech and beer) common training set. One key would be that this common training set either provide the exact text that should be run in the segmenter or exact instructions on how to produce the text to run in the segmenter (re: see the issue I mentioned below of the ambiguity around how to actually test the Brown corpus).

link

kylebgorman 4139 days ago

For comparability, most people use the Penn Treebank-III WSJ data. Sections 03-06 are test, the remaining sections are train/dev.

Most methods are based on some sort of simple feature templates and machine learning, so they should generalize relatively well to a wide variety of languages, IMO.

link

ldng 4139 days ago

Not only there is little sharing, it is very focused on English.

link