Hacker News new | ask | show | jobs
by edblarney 3507 days ago
Yup.

You HN lads are smart, you're pretty quick to figure out all the 'next problems' that one would encounter.

Yes - getting the right training data can be surprisingly hard.

Did you know how hard it is to get a 'very official' large set of words for a given language? It's hard!

There is no entity that really decides what language is - so you have to kind of determine it from what people write. But that takes a lot of writing, and frankly, you're making assumptions all the time there.

France has a body that's 'in charge' of their language so to speak, and most Western nations have entities that are 'roughly' that. Beyond the West, Japan and China ... it's a gong show.

'Filipino' is barely a language - even though many millions of people speak it, it varies in dialect from village to village and they barely resemble each other.

I think that someone will eventually come up with a 'probabilistic' OS because in the real world, nothing is certain ... some things are just more likely than others!

1 comments

An official set of words is only useful if your NLP task is restricted to items that themselves are restricted in their language use. Twitter and SMS data sets are interesting because they represent something closer to casual speech rather than formal writing.

The French Academy provides an official dictionary and language usage, but speakers hardly restrict themselves to its contents.

Filipino mostly refers to the Manila dialect of Tagalog, whereas Tagalog is a language with many dialects existing in the Philippines. There are lots of languages in the Philippines but as far as I know they aren't referred to as Filipino.

For a lot of NLP problems you will probably have to make your own data set. It can be a lot of work.