Hacker News new | ask | show | jobs
by afsina 2919 days ago
Good luck with Turkish. A very large lookup table may work for %90 cases but it will still fail. A basic `true` lemmatizer requires either a complex graph with rules, or an FST generated from it. input is searched through the graph and morphological disambiguator must be applied to result to pick the correct lemma.
1 comments

Or Welsh. You want a canonical form for "wnaethpwyd"? Try "gwneud"! Some regexes and an exceptions list isn't going to cut it.

The more a language needs a lemmatizer for NLP the harder it is to write it.

Could someone explain what's hard in particular about canonical forms in Welsh?