Hacker News new | ask | show | jobs
by plq 2671 days ago
The stochastic strategy is to 1. enumerate every possible tag combination 2. assign a probability to each one 3. choose the parse with highest probability.

1. can be done either deterministically or stochastically.

2. requires you to have a language model trained with either human-tagged or semi-human-tagged corpus

3. was just the Viterbi algorithm last time I looked.

Implementing 1 and 2 are require broad domain knowledge in two very different domains (linguistics and machine learning respectively)

So while nowadays sentence segmentation can be considered a solved problem, it's far from trivial to implement one that can compete with the state of the art against real-world data.

There is also a nice body of deterministic (rule-based) literature that is practically ignored nowadays.