Hacker News new | ask | show | jobs
by darawk 2794 days ago
> Another instance of the "more data is better" fallacy.

More data is better. The fact that humans have better algorithms that need less data does absolutely nothing to negate this.

4 comments

Depends on whether it’s the right data, or meaningful data. The idea that Lyft needs this acquisition in order to implement video capture from its drivers’ cell phones is laughable so something else is going on with this acquisition. But to your point, the assumption that this is even a problem of “not enough data” is questionable at this point. How to turn that data into results is something no one has come close to figuring out yet.
> Depends on whether it’s the right data, or meaningful data.

Street level mapping data isn't relevant or meaningful? Basically every company working on this problem seems to pretty strongly disagree with you.

> But to your point, the assumption that this is even a problem of “not enough data” is questionable at this point. How to turn that data into results is something no one has come close to figuring out yet.

This is trivially false. Given infinite data, all possible situations would be represented in the data, and the solutions applied in those situations could be copied exactly, something that existing algorithms are completely capable of doing.

>> This is trivially false. Given infinite data, all possible situations would be represented in the data, and the solutions applied in those situations could be copied exactly, something that existing algorithms are completely capable of doing.

In principle. In practice, you'd need infinite time and infinite storage.

Btw, do you have to add stuff like "This is trivially false" to your comments? It doesn't make your comments sound more right, only less well considered.

> In principle. In practice, you'd need infinite time and infinite storage.

That is irrelevant.

> Btw, do you have to add stuff like "This is trivially false" to your comments? It doesn't make your comments sound more right, only less well considered.

Trivial in the mathematical sense. As in, there is a trivial counter-example to your point. Citing infinity is a 'trivial' case. I'm using 'trivial' to describe my counter-example, not his error.

If I may be frank: don't use language in the mathematical sense if you are not in a maths classroom or you dont' know maths.
Given infinote fata, infine storage and infinite computing you would be right. In practice it means you are wrong. Feeding more data does not necessarily help given a finite amount of computing power.
More data is necessary with current technoloogy, in the sense that modern statistical machine learning algorithms are very bad at generalising to unseen data, and the only way to overcome this is to give them more examples.

There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.

Also, though more speculatively, I think the idea of "lots of data" is attractive to marketing departments. There's something about algorithms that need huge amounts of data and compute, that only a select few companies can use. I guess it gives bragging rights, of a sort: "we got the biggest data around. Buy our stuff!".

But like I say, that's speculation on my part.

> More data is necessary with current technoloogy, in the sense that modern statistical machine learning algorithms are very bad at generalising to unseen data, and the only way to overcome this is to give them more examples.

Precisely.

> There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.

Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.

> Also, though more speculatively, I think the idea of "lots of data" is attractive to marketing departments. There's something about algorithms that need huge amounts of data and compute, that only a select few companies can use. I guess it gives bragging rights, of a sort: "we got the biggest data around. Buy our stuff!".

It may be attractive to marketing departments, but it is also essential to data science projects like this.

>> Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.

Not to my knowledge. What techniques did you have in mind that work like that?

Literally all of them? Linear regression, neural networks, KNN, I could just enumerate all ML methods here, but I think the foregoing is sufficient.
I'm sorry, I don't understand. Which of the above generalises well from small datasets?
Who said they do? I said they generalize better from larger datasets. The entire point of this discussion is that more data is better.
A small amount of the right data is better than lots of the wrong data. Collecting a lot of some data, because it's easy to collect isn't very helpful if it turns out to be the wrong data.

It would likely be more informative to instrument a few cars with some advanced sensor package and let well ranked drivers drive them around than to try to gather data from smartphones in existing cars, but I suppose it depends on what the end use is.

This is mapping data, not driving data. You need both.
More data is not always better, it can be for sure, you need to have the analytical capabilities to turn it into useful information. Otherwise it's just hoarding.
They have those capabilities.