Hacker News new | ask | show | jobs
by nxb 3941 days ago
Also worth pointing out that in languages which do divide the words, they often divide the words too much, such that the individual words have little to no relation to the meaning of the longer phrase. E.g. new york, arm and a leg, kick the bucket.

This, over-dividing problem, along with the under-dividing problem you mentioned, are both huge hurdles for machine textual understanding and machine translation systems.

On the information-extraction system I'm working on now, roughly 80% of the entities we're trying to extract are multi-word expressions. Very difficult.

[1] https://en.wikipedia.org/wiki/Multiword_expression

[2] http://aclweb.org/aclwiki/index.php?title=Multiword_Expressi...

[3] http://lingo.stanford.edu/pubs/WP-2001-03.pdf

2 comments

Is it actually described as "over-dividing" in an academic sense? Those individual words do have meaning, but they have later been recombined into forms that have new, sometimes orthogonal meanings. I can see the argument for mashing them back together in that case, but "over-divided" seems a strange way to look at it.
A related idea is linguistic "compositionality"

https://en.wikipedia.org/wiki/Principle_of_compositionality

I don't have hard numbers, but I know from experience that a large share of multiword expressions are non-compositional (the meaning of the larger phrase can't be inferred from its constituents), so in that case thinking of them as "over-divided" makes sense to me.

In linguistics, we call such phrases "idioms".

https://en.wikipedia.org/wiki/Idiom

Cool, I've also done a lot of thinking/work related to multiword expressions. A couple years back I did word segmentation on Khmer too. I'd be curious to hear more about your work!