Hacker News new | ask | show | jobs
by darawk 2793 days ago
> More data is necessary with current technoloogy, in the sense that modern statistical machine learning algorithms are very bad at generalising to unseen data, and the only way to overcome this is to give them more examples.

Precisely.

> There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.

Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.

> Also, though more speculatively, I think the idea of "lots of data" is attractive to marketing departments. There's something about algorithms that need huge amounts of data and compute, that only a select few companies can use. I guess it gives bragging rights, of a sort: "we got the biggest data around. Buy our stuff!".

It may be attractive to marketing departments, but it is also essential to data science projects like this.

1 comments

>> Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.

Not to my knowledge. What techniques did you have in mind that work like that?

Literally all of them? Linear regression, neural networks, KNN, I could just enumerate all ML methods here, but I think the foregoing is sufficient.
I'm sorry, I don't understand. Which of the above generalises well from small datasets?
Who said they do? I said they generalize better from larger datasets. The entire point of this discussion is that more data is better.
I was referring to this part of our exchange:

ME: There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.

YOU: Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.

That is not how those techniques work to my knowledge, so I was asking which ones you had in mind.

My point is that they all generalize better from larger datasets. Size is relative and some techniques work better with more or less data. Linear regression, for instance, can work quite well with much less data than a neural net. It just depends on the complexity of the problem.