| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isani 2671 days ago
	Japanese is usually written without spaces. Words and sentences just run into each other. When writing in hiragana (syllabic characters), word boundaries are often ambiguous. Englishwouldbemuchhardertoparseifwrittenlikethis.

3 comments

saagarjha 2671 days ago

I have no stake in natural language processing, but it looks to me like a computer might be able to do a pretty good job at splitting that given a dictionary.

link

isani 2671 days ago

Sure, you can get pretty far with a fairly simple solution. But lot of the time, you get two (or more) ways to split the string into dictionary words. For a simple English example, is it "justice was served" or "just ice was served"?

link

saagarjha 2671 days ago

I guess that’s where context will have to be considered. Those two are valid sentences, so presumably humans are using context to distinguish between them, right?

link

MereInterest 2671 days ago

The murderer came to my dinner party, and I had it all planned. In one of the ice cubes, I had frozen arsenic. The murderer would eat the same food, drink the same drink, and nobody would guess that they would die on leaving. When the evening was over, I knew what I would tell people.

Justicehadbeenserved.

link

woliveirajr 2671 days ago

Please, share this with the world on tweeter.

link

MereInterest 2670 days ago

If you would like to, feel free. For myself, I think that the comment's context of showing how ambiguity may not be resolved merely be contextual information is important, and that it would not stand as strongly without it.

link

plq 2671 days ago

The stochastic strategy is to 1. enumerate every possible tag combination 2. assign a probability to each one 3. choose the parse with highest probability.

1. can be done either deterministically or stochastically.

2. requires you to have a language model trained with either human-tagged or semi-human-tagged corpus

3. was just the Viterbi algorithm last time I looked.

Implementing 1 and 2 are require broad domain knowledge in two very different domains (linguistics and machine learning respectively)

So while nowadays sentence segmentation can be considered a solved problem, it's far from trivial to implement one that can compete with the state of the art against real-world data.

There is also a nice body of deterministic (rule-based) literature that is practically ignored nowadays.

link

gizmo686 2671 days ago

But Japanese is not written as character soup. It mixes two (actually 3) types of characters, with the "grammatical" sounds being written in hiragana and most content sounds being written in kanji. Since the grammatical sounds are a closed class, and tend to occur at word boundries, it turns out to be relativly simple to seperate words.

link

szatkus 2671 days ago

Isn't it a case when parsing other languages from speech? Are there any audible cues between words when we speak?

link