Great work. I love the "Golden Rules" list you compiled. It seems like teams develop their NLP systems without sharing a common training set which leaves some teams without testing things like the "a.m. / p.m." thing.
See my comment below for some of the reasons I've had issues trying to test the commonly used segmentation corpora. I completely agree it would be great if there was a free (as in both speech and beer) common training set. One key would be that this common training set either provide the exact text that should be run in the segmenter or exact instructions on how to produce the text to run in the segmenter (re: see the issue I mentioned below of the ambiguity around how to actually test the Brown corpus).
For comparability, most people use the Penn Treebank-III WSJ data. Sections 03-06 are test, the remaining sections are train/dev.
Most methods are based on some sort of simple feature templates and machine learning, so they should generalize relatively well to a wide variety of languages, IMO.