Hacker News new | ask | show | jobs
by kleigenfreude 3506 days ago
> 1) It's 'hard' because you need a lot of 'training data' in order to train models etc.. It's hard to get.

Also, I'd imagine that the data could be bad/incomplete, e.g. data was collected in an inconsistent manner or in the wrong areas, leading to an incorrect solution that fits the data, but doesn't solve the problem.

This is the biggest concern I have in using the data that we've collected to come up with a solution using ML: no one ever intended for the data to be used for the purpose for which I would use it, and is incomplete or incorrect.

However, I think the chance of good things coming from inadequate data outweighs not trying to make use of the data.

1 comments

Yup.

You HN lads are smart, you're pretty quick to figure out all the 'next problems' that one would encounter.

Yes - getting the right training data can be surprisingly hard.

Did you know how hard it is to get a 'very official' large set of words for a given language? It's hard!

There is no entity that really decides what language is - so you have to kind of determine it from what people write. But that takes a lot of writing, and frankly, you're making assumptions all the time there.

France has a body that's 'in charge' of their language so to speak, and most Western nations have entities that are 'roughly' that. Beyond the West, Japan and China ... it's a gong show.

'Filipino' is barely a language - even though many millions of people speak it, it varies in dialect from village to village and they barely resemble each other.

I think that someone will eventually come up with a 'probabilistic' OS because in the real world, nothing is certain ... some things are just more likely than others!

An official set of words is only useful if your NLP task is restricted to items that themselves are restricted in their language use. Twitter and SMS data sets are interesting because they represent something closer to casual speech rather than formal writing.

The French Academy provides an official dictionary and language usage, but speakers hardly restrict themselves to its contents.

Filipino mostly refers to the Manila dialect of Tagalog, whereas Tagalog is a language with many dialects existing in the Philippines. There are lots of languages in the Philippines but as far as I know they aren't referred to as Filipino.

For a lot of NLP problems you will probably have to make your own data set. It can be a lot of work.