| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by darawk 2794 days ago
	> Another instance of the "more data is better" fallacy. More data is better. The fact that humans have better algorithms that need less data does absolutely nothing to negate this.

4 comments

skywhopper 2794 days ago

Depends on whether it’s the right data, or meaningful data. The idea that Lyft needs this acquisition in order to implement video capture from its drivers’ cell phones is laughable so something else is going on with this acquisition. But to your point, the assumption that this is even a problem of “not enough data” is questionable at this point. How to turn that data into results is something no one has come close to figuring out yet.

darawk 2794 days ago

> Depends on whether it’s the right data, or meaningful data.

Street level mapping data isn't relevant or meaningful? Basically every company working on this problem seems to pretty strongly disagree with you.

> But to your point, the assumption that this is even a problem of “not enough data” is questionable at this point. How to turn that data into results is something no one has come close to figuring out yet.

This is trivially false. Given infinite data, all possible situations would be represented in the data, and the solutions applied in those situations could be copied exactly, something that existing algorithms are completely capable of doing.

YeGoblynQueenne 2793 days ago

>> This is trivially false. Given infinite data, all possible situations would be represented in the data, and the solutions applied in those situations could be copied exactly, something that existing algorithms are completely capable of doing.

In principle. In practice, you'd need infinite time and infinite storage.

Btw, do you have to add stuff like "This is trivially false" to your comments? It doesn't make your comments sound more right, only less well considered.

darawk 2793 days ago

> In principle. In practice, you'd need infinite time and infinite storage.

That is irrelevant.

> Btw, do you have to add stuff like "This is trivially false" to your comments? It doesn't make your comments sound more right, only less well considered.

Trivial in the mathematical sense. As in, there is a trivial counter-example to your point. Citing infinity is a 'trivial' case. I'm using 'trivial' to describe my counter-example, not his error.

YeGoblynQueenne 2793 days ago

If I may be frank: don't use language in the mathematical sense if you are not in a maths classroom or you dont' know maths.

mijamo 2793 days ago

Given infinote fata, infine storage and infinite computing you would be right. In practice it means you are wrong. Feeding more data does not necessarily help given a finite amount of computing power.

YeGoblynQueenne 2793 days ago

More data is necessary with current technoloogy, in the sense that modern statistical machine learning algorithms are very bad at generalising to unseen data, and the only way to overcome this is to give them more examples.

There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.

Also, though more speculatively, I think the idea of "lots of data" is attractive to marketing departments. There's something about algorithms that need huge amounts of data and compute, that only a select few companies can use. I guess it gives bragging rights, of a sort: "we got the biggest data around. Buy our stuff!".

But like I say, that's speculation on my part.

darawk 2793 days ago

> More data is necessary with current technoloogy, in the sense that modern statistical machine learning algorithms are very bad at generalising to unseen data, and the only way to overcome this is to give them more examples.

Precisely.

> There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.

Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.

> Also, though more speculatively, I think the idea of "lots of data" is attractive to marketing departments. There's something about algorithms that need huge amounts of data and compute, that only a select few companies can use. I guess it gives bragging rights, of a sort: "we got the biggest data around. Buy our stuff!".

It may be attractive to marketing departments, but it is also essential to data science projects like this.

YeGoblynQueenne 2793 days ago

>> Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.

Not to my knowledge. What techniques did you have in mind that work like that?

darawk 2792 days ago

Literally all of them? Linear regression, neural networks, KNN, I could just enumerate all ML methods here, but I think the foregoing is sufficient.

YeGoblynQueenne 2792 days ago

I'm sorry, I don't understand. Which of the above generalises well from small datasets?

darawk 2792 days ago

Who said they do? I said they generalize better from larger datasets. The entire point of this discussion is that more data is better.

toast0 2794 days ago

A small amount of the right data is better than lots of the wrong data. Collecting a lot of some data, because it's easy to collect isn't very helpful if it turns out to be the wrong data.

It would likely be more informative to instrument a few cars with some advanced sensor package and let well ranked drivers drive them around than to try to gather data from smartphones in existing cars, but I suppose it depends on what the end use is.

darawk 2794 days ago

This is mapping data, not driving data. You need both.

agumonkey 2793 days ago

More data is not always better, it can be for sure, you need to have the analytical capabilities to turn it into useful information. Otherwise it's just hoarding.

darawk 2793 days ago

They have those capabilities.