|
|
|
|
|
by kleigenfreude
3506 days ago
|
|
> 1) It's 'hard' because you need a lot of 'training data' in order to train models etc.. It's hard to get. Also, I'd imagine that the data could be bad/incomplete, e.g. data was collected in an inconsistent manner or in the wrong areas, leading to an incorrect solution that fits the data, but doesn't solve the problem. This is the biggest concern I have in using the data that we've collected to come up with a solution using ML: no one ever intended for the data to be used for the purpose for which I would use it, and is incomplete or incorrect. However, I think the chance of good things coming from inadequate data outweighs not trying to make use of the data. |
|
You HN lads are smart, you're pretty quick to figure out all the 'next problems' that one would encounter.
Yes - getting the right training data can be surprisingly hard.
Did you know how hard it is to get a 'very official' large set of words for a given language? It's hard!
There is no entity that really decides what language is - so you have to kind of determine it from what people write. But that takes a lot of writing, and frankly, you're making assumptions all the time there.
France has a body that's 'in charge' of their language so to speak, and most Western nations have entities that are 'roughly' that. Beyond the West, Japan and China ... it's a gong show.
'Filipino' is barely a language - even though many millions of people speak it, it varies in dialect from village to village and they barely resemble each other.
I think that someone will eventually come up with a 'probabilistic' OS because in the real world, nothing is certain ... some things are just more likely than others!