Hacker News new | ask | show | jobs
by isani 2671 days ago
Japanese is usually written without spaces. Words and sentences just run into each other. When writing in hiragana (syllabic characters), word boundaries are often ambiguous.

Englishwouldbemuchhardertoparseifwrittenlikethis.

3 comments

I have no stake in natural language processing, but it looks to me like a computer might be able to do a pretty good job at splitting that given a dictionary.
Sure, you can get pretty far with a fairly simple solution. But lot of the time, you get two (or more) ways to split the string into dictionary words. For a simple English example, is it "justice was served" or "just ice was served"?
I guess that’s where context will have to be considered. Those two are valid sentences, so presumably humans are using context to distinguish between them, right?
The murderer came to my dinner party, and I had it all planned. In one of the ice cubes, I had frozen arsenic. The murderer would eat the same food, drink the same drink, and nobody would guess that they would die on leaving. When the evening was over, I knew what I would tell people.

Justicehadbeenserved.

Please, share this with the world on tweeter.
If you would like to, feel free. For myself, I think that the comment's context of showing how ambiguity may not be resolved merely be contextual information is important, and that it would not stand as strongly without it.
The stochastic strategy is to 1. enumerate every possible tag combination 2. assign a probability to each one 3. choose the parse with highest probability.

1. can be done either deterministically or stochastically.

2. requires you to have a language model trained with either human-tagged or semi-human-tagged corpus

3. was just the Viterbi algorithm last time I looked.

Implementing 1 and 2 are require broad domain knowledge in two very different domains (linguistics and machine learning respectively)

So while nowadays sentence segmentation can be considered a solved problem, it's far from trivial to implement one that can compete with the state of the art against real-world data.

There is also a nice body of deterministic (rule-based) literature that is practically ignored nowadays.

But Japanese is not written as character soup. It mixes two (actually 3) types of characters, with the "grammatical" sounds being written in hiragana and most content sounds being written in kanji. Since the grammatical sounds are a closed class, and tend to occur at word boundries, it turns out to be relativly simple to seperate words.
Isn't it a case when parsing other languages from speech? Are there any audible cues between words when we speak?