Hacker News new | ask | show | jobs
by tambourine_man 5292 days ago
Sometimes I get the feeling that when we had less data, we were forced to think harder and more daringly. I feel we lack new groundbreaking theoretical framework because of this.

I don't know if Newton's law's would jump out of the paper if you simply threw a ball at one million different vectors.

5 comments

On the other hand, a lot of new research (including possibly ground-breaking theoretical results) are only possible now that we have access to large data.

We might be initially processing the large data using relatively simple techniques, but on the reduced data, we can now run more sophisticated methods that actually work because the underlying data comes from a huge number of samples.

As but one example, in computer vision, the concept of "attributes" -- automatically labeling objects using descriptive words instead of categorical ones, i.e., "this thing is like..." rather than "this thing is..." -- has opened the door to a number of exciting advances. One is the concept of "zero-shot learning": automatically recognizing an object that you've never seen an instance of before simply via a description. For example, one could recognize beavers as "small, four-legged furry rodents with big teeth and a flat tail", without having ever seen a beaver before. The training data for this classifier need not include beavers, but only images which match the individual attributes, not necessarily all in the same image -- small, four-legged, furry, rodent, big teeth, flat tail.

This kind of thing was not really possible before, because there just wasn't enough data to train reliable classifiers for each attribute in any kind of automated way.

Finally, as I alluded to at the beginning, these individual attribute classifiers are often relatively simple algorithms, such as Support Vector Machines (SVMs). Yet, the 2nd-stage algorithms that use the attribute values to do something useful, such as the zero-shot learning application described above, are often much more involved/advanced techniques.

I recently visited the Galapagos Islands. There are 2 things that made it possible for Darwin to work out his theory after visiting here.

1. Remoteness of location - few outside influences 2. Relatively few species!

Even though it's on the equator, the islands aren't all jungle and animals. The sheer lack of different species made it possible to see every single one of them in a single visit, and allowed Darwin to theorize without thinking he missed something.

Sometimes, simplicity helps with focus

There's still quite a bit of work going on in various directions; "simple theory on big data" is just one of many research agendas in statistics/ML, and it's not really the majority one (though it gets quite a bit of press). It's what Google pushes in part because it's their competitive advantage: they have more data, and the ability to access/manipulate it in reasonable time, than many other places do, so it makes sense for them to see what they can get out of it.
On the bright side with so many people online and with different perspectives, it becomes easier to expose flaws, mediocre interpretations, etc.
Well, use data at large scale is the new groundbreaking theoretical framework. And it's practical too.