| Yup. You HN lads are smart, you're pretty quick to figure out all the 'next problems' that one would encounter. Yes - getting the right training data can be surprisingly hard. Did you know how hard it is to get a 'very official' large set of words for a given language? It's hard! There is no entity that really decides what language is - so you have to kind of determine it from what people write. But that takes a lot of writing, and frankly, you're making assumptions all the time there. France has a body that's 'in charge' of their language so to speak, and most Western nations have entities that are 'roughly' that. Beyond the West, Japan and China ... it's a gong show. 'Filipino' is barely a language - even though many millions of people speak it, it varies in dialect from village to village and they barely resemble each other. I think that someone will eventually come up with a 'probabilistic' OS because in the real world, nothing is certain ... some things are just more likely than others! |
The French Academy provides an official dictionary and language usage, but speakers hardly restrict themselves to its contents.
Filipino mostly refers to the Manila dialect of Tagalog, whereas Tagalog is a language with many dialects existing in the Philippines. There are lots of languages in the Philippines but as far as I know they aren't referred to as Filipino.
For a lot of NLP problems you will probably have to make your own data set. It can be a lot of work.