Hacker News new | ask | show | jobs
by kylebgorman 4086 days ago
FWIW, basic machine learning approaches to "sentence boundary detection" (as the task is called) get 199 out of 200 of these right (without using the "two space" clue), and have for a while. (e.g., http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-b...)
1 comments

For the purpose of version control, it doesn't even have to be exact. It doesn't matter if the detector inserts an incorrect line break after a certain combination of characters, as long as it does so consistently so that it produces a readable diff.