Hacker News new | ask | show | jobs
by renesd 3455 days ago
This used to be true. Modern machine learning needs way less data. The classic example is taking images and then transforming them in hundreds of ways (scaling, rotation, skew, etc) for training.

Big data is no where near as much a competitive advantage as it was three years ago. It seems not everyone outside the field has noticed that though.

3 comments

I wonder if this would be true in the case of self-driving car algorithms, though (which I know nothing about). It always seemed like the hard part about self-driving cars was the 0.1% edge cases where something out of the ordinary could result in a catastrophe if not handled correctly.

Image classification seems like it would be very different, most importantly that 99.9% "correct" would be a great achievement, but for self-driving cars a .1% failure rate would be completely unacceptable.

We are 30 to 50 years away of a level 5 car.
Do you have more examples of "big data is not as much a competitive advantage", in the form of articles or research? I'm not in this field, but it's a fascinating development. It would be interesting to see to which degree it helps to perform automated transformations to increase the value of each piece of training data.
Lul thinking machine learning is only image classification
Thank you for the down-votes. I guess all the experts on HN know how easy it is to simulate training data because they all took the 101 course on how to rotate/resample images. That is uniquely a image classification technique.

Please oh wise ones how do we simulate nlp data, numeric data, finance data, biological data and anything else machine learning is used for.

Oh you are able to classify dogs and cats in images after a 2 hour youtube. How nice.

Both you and renesd are correct.

Renesd is correct that "big data" is overblown. There are diminishing marginal returns - you need orders of magnitude more data for the same incremental gain (and this blows up well beyond however millions of cars Tesla can hope to run).

You're correct that data augmentation is only a marginal technique to squeeze out more performance, and not generally possible in many domains.

From what I've observed of member behavior on HN, I suspect that the downvotes may be a response to tone as opposed to content.