Hacker News new | ask | show | jobs
by minimaxir 3989 days ago
The actual problem with learning "data science" is making inferences and conclusions which do not violate the laws of statistics.

I've seen many submissions to Hacker News and Reddit's /r/dataisbeautiful subreddit where the author goes "look, the analysis supports my conclusion and the R^2 is high, therefore this is a good analysis!" without addressing the assumptions required for those results.

Of course, not everyone has a strong statistical background. Except I've seen YC-funded big data startups and venture capitalists commit the same mistakes, who should really, really know better.

"Data science" is a buzzword that successful only due to obscurity and no one actually caring if the statistics are valid. That's why I've been attempting to open source all my statistical analyses/visualizations, with detailed steps on how to reproduce. (see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/ )

6 comments

Roughly 80% of data scientists I know have PhD in something very math heavy. Rest have masters degrees. There are programmers who can assist them doing the grunt work but it's just basic programming to assist analysts to crunch data.

If you want to do data science for real:

1. Get Masters of PhD from statistics, computer science, economics, physics or some other heavy field and specialize data analysis in that field. You must learn lots of statistics when doing so.

2. Learn programming, statistical machine learning and tools of the trade.

Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

I've seen a huge range in the people calling themselves data scientists. Some have very analytically intensive academic degrees, others just finished a data science boot camp, and there are a lot of people that used to be called 'business analysts' who are basically doing the same job with a fancier title. In every group, I've had people tell me that what they're doing is really data science, because data science needs the (academic|integrated|business) perspective that they have, and what the other people are doing isn't really data science.
> Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

This resonates. That is, picking and designing features. Also understand dependent variables and knowing how to test for that, which is the biggest mistakes leading to flawed conclusions I see from the 'general public'.

What do you mean by testing for dependent variables?
Maybe something to do with instrumental variables? https://en.wikipedia.org/wiki/Instrumental_variable
Academic credentials aren't enough, good data-driven decisionmaking is as much an art as an academic discipline. A p-value of .01 is a Nobel Prize in medicine and unpublishable in physics -- domain knowledge is important to have a feel for the difference.
Assuming that smart autodidacts can't obtain sound statistics knowledge is selling many people short.
I think you are right in that it sells many people short, but then again having no good academic credentials is selling yourself short.

Data science is not like security. There it is more accepted that good engineers/researchers do not necessarily have the best accreditation. It seems that data science/engineering is turning around to this though.

It's not that autodidacts can not build bridges, it is that the people with the data and money do not want their bridges build by autodidacts.

Anyway... back to studying http://statweb.stanford.edu/~tibs/ElemStatLearn/ for me :).

No. A phd in statistics or economics means almost nothing at this point. Even if it did, truly, signal mastery of the content, which it doesn't anymore, it would signal to most people who do this kind of work that you're way overqualified while simultaneously being totally ignorant of the day-to-day work of actual data scientists.

If you want to be a useful data scientist, do a lot of work with data. If you have strong programming skills and are flexible and a quick learner then you will do well.

Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.

There absolutely are problems that require a more rigorous mathematical training than you get from undergraduate courses or day-to-day experience. Most data scientists and companies may not be tackling these problems, but they certainly exist.

Just having a PhD will open doors for you that would otherwise be shut. But before pursuing that degree, you should be confident that you enjoy working in the field and want to devote your career to it. Also, you have to be prepared to work hard, not just to get the degree, but then to land a job where you'll put that experience to use. Otherwise, you'll be sharing a cubicle with DataWorker and feeling like a fool.

That said, if you don't know whether you need a PhD, that means you probably don't know what kinds of problem you want to work on. And in that case, there's a good chance you'll end up working on a problem that only interests your advisor and nobody else (most PhD advisors have more students than they have good problems to work on). In that case, I wouldn't recommend it.

I've had complete opposite experience. Do the people who hire for this kind of work often bet on non-PhD candidates? Do they trust themselves to separate the wheat from the chaff?

Don't you want a colleague who is able to mention seminal papers for specific problems? Who is able to read and understand these papers and can distill useful features and optimizations from them?

People with PhD who go into business, usually end up in the better positions. They hire other PhD's for the good positions to keep the signal (mastery of the content) stronger.

As someone who did a lot of work with data I have little problem with my usefulness, but a lot of problems opening doors to the really interesting data companies (lacking a proper academic network). I wish I had gotten that PhD, because right now applying to Google, Microsoft, Facebook, Yahoo or eBay for data science positions makes me look like a fool.

I've met a lot of fools who've quoted all the right works, in both Computer Science and Data Science. Computer Science fools usually get fired. Data Science fools seem to get promoted to Yes-man status. It's a lot harder to lie about your code than it is with statistics; As the old adage goes, it right behind Lies and Damned Lies.
You've said "No", but you haven't countered the posters claims. Are in fact most data scientists PhD/Masters people? I hear the same information at a mid-sized tech company. I also hear similar things about Intel.
>Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.

This is why you are a DataWorker, and not a dataScientist.

Anyone can push bits around. It takes a trained mind to corral them using careful experimentation and observation.

It's dangerous to make big generalizations like "no one actually caring if the statistics are valid." This simply is not true. Sure, a lot of what you see on /r/dataisbeautiful is garbage, but that's because it's an open forum where anyone can show what they think they have found. Usually, whenever someone makes an egregious statistical error, they are called out for it. Of course, the same happens on larger scales and even in published research.

"Data science" at it's core is just statistical analysis, but it has been slowly morphing over the past few decades thanks to the budding field of machine learning and the commoditization of computing power. This has drastically changed the field of statistical research, and although the underlying math is the same, the tools and the amount of data are constantly in flux. Someone along the way must have felt that this evolution of statistical analysis needed a new name. In all honesty, it's just a name, and it doesn't matter. What matters is if you understand how to use it.

The classic venn diagram of data science is still helpful: http://drewconway.com/zia/2013/3/26/the-data-science-venn-di...

This article reads like a way to find yourself in the danger zone.

I've never seen this venn diagram before--thanks for bringing it up. I find that, as an academic (pursuing a Ph.D. in astrophysics) that plenty of traditional researchers are able to hack together code (many haven't ever taken a formal programming course; http://arxiv.org/abs/1507.03989) but many also misuse or can't interpret statistics (from personal experience). That puts us in the danger zone!

I think the key is to find the mathematics and statistics interesting because you want the [data] science to be meaningful. If that's a driving force, then you can learn math and statistics on your own (like the author did). Otherwise, yes--you will find yourself in the danger zone.

What if you just want to get paid well to play with interesting tools?
I assume data science is often used in place of astrology. People just want to have something to cling to, to get over their fears and insecurity. So if you can generate some reassuring graphs, who cares if they are based on solid statistics or not?
>People just want to have something to cling to, to get over their fears and insecurity.

The other side of this is that some businesses (especially SMBs) are so horrible at utilizing their data that very basic analyses can reap big gains (80/20 rule!). For the vast majority of businesses there is no need for elaborate models or machine learning techniques.

>see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/

They all seem very well-presented[0], but I can't help but ask - what do you do with this new information?

[0] https://i.imgur.com/PvWYB2n.png

I am planning a blog post on the relationship between YouTube video duration on other statistics. So I did a little exploratory analysis to validate the data.

As shown, the distribution of durations in Music videos is much, much different than all other categories. As a result, it skews nearly every other analysis and I may have to exclude videos from the Music category entirely.

what you described is prevalent in social sciences. Ton of biases and causation/correlation error and putting blind trust in some arcane statistical analysis without knowing what they really mean. Conclusion: statistically significant is the magic word peppered throughout academic literature.
Agreed. Fortunately, excellent social scientists really care about this -- see Andy Gelman's blog for many rants on this topic.