Hacker News new | ask | show | jobs
by int_19h 341 days ago
The complexity of these rules, and the number of exceptions that you need to learn notwithstanding the rules, can be roughly estimated for any given language by training a language model on word <-> IPA correspondence for that language (using a subset of the vocabulary as a training set), and then seeing how well it can predict the remaining words. You can run it in either direction, too, to separately measure the difficulty of reading (word -> IPA) and writing (IPA -> word) that language.

This was actually done for a number of languages including English:

https://arxiv.org/abs/1912.13321

You can see how languages with true phonemic spellings tend to be in the >90% range on both reading and writing, with Esperanto at 99%. Spanish and German are in 60-80% range. English is dismal at ~30% for both, though, with only French and Chinese being harder to write, and all other languages tested being easier to read.

1 comments

Nice!