| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 3989 days ago

The actual problem with learning "data science" is making inferences and conclusions which do not violate the laws of statistics.

I've seen many submissions to Hacker News and Reddit's /r/dataisbeautiful subreddit where the author goes "look, the analysis supports my conclusion and the R^2 is high, therefore this is a good analysis!" without addressing the assumptions required for those results.

Of course, not everyone has a strong statistical background. Except I've seen YC-funded big data startups and venture capitalists commit the same mistakes, who should really, really know better.

"Data science" is a buzzword that successful only due to obscurity and no one actually caring if the statistics are valid. That's why I've been attempting to open source all my statistical analyses/visualizations, with detailed steps on how to reproduce. (see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/ )

6 comments

nabla9 3989 days ago

Roughly 80% of data scientists I know have PhD in something very math heavy. Rest have masters degrees. There are programmers who can assist them doing the grunt work but it's just basic programming to assist analysts to crunch data.

If you want to do data science for real:

1. Get Masters of PhD from statistics, computer science, economics, physics or some other heavy field and specialize data analysis in that field. You must learn lots of statistics when doing so.

2. Learn programming, statistical machine learning and tools of the trade.

Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

dworin 3989 days ago

I've seen a huge range in the people calling themselves data scientists. Some have very analytically intensive academic degrees, others just finished a data science boot camp, and there are a lot of people that used to be called 'business analysts' who are basically doing the same job with a fancier title. In every group, I've had people tell me that what they're doing is really data science, because data science needs the (academic|integrated|business) perspective that they have, and what the other people are doing isn't really data science.

whistlerbrk 3989 days ago

> Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

This resonates. That is, picking and designing features. Also understand dependent variables and knowing how to test for that, which is the biggest mistakes leading to flawed conclusions I see from the 'general public'.

stdbrouw 3989 days ago

What do you mean by testing for dependent variables?

letourdefrance 3988 days ago

Maybe something to do with instrumental variables? https://en.wikipedia.org/wiki/Instrumental_variable

qq66 3989 days ago

Academic credentials aren't enough, good data-driven decisionmaking is as much an art as an academic discipline. A p-value of .01 is a Nobel Prize in medicine and unpublishable in physics -- domain knowledge is important to have a feel for the difference.

decisiveness 3989 days ago

Assuming that smart autodidacts can't obtain sound statistics knowledge is selling many people short.

Baghard 3989 days ago

I think you are right in that it sells many people short, but then again having no good academic credentials is selling yourself short.

Data science is not like security. There it is more accepted that good engineers/researchers do not necessarily have the best accreditation. It seems that data science/engineering is turning around to this though.

It's not that autodidacts can not build bridges, it is that the people with the data and money do not want their bridges build by autodidacts.

Anyway... back to studying http://statweb.stanford.edu/~tibs/ElemStatLearn/ for me :).

DataWorker 3989 days ago

No. A phd in statistics or economics means almost nothing at this point. Even if it did, truly, signal mastery of the content, which it doesn't anymore, it would signal to most people who do this kind of work that you're way overqualified while simultaneously being totally ignorant of the day-to-day work of actual data scientists.

If you want to be a useful data scientist, do a lot of work with data. If you have strong programming skills and are flexible and a quick learner then you will do well.

Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.

freyr 3989 days ago

There absolutely are problems that require a more rigorous mathematical training than you get from undergraduate courses or day-to-day experience. Most data scientists and companies may not be tackling these problems, but they certainly exist.

Just having a PhD will open doors for you that would otherwise be shut. But before pursuing that degree, you should be confident that you enjoy working in the field and want to devote your career to it. Also, you have to be prepared to work hard, not just to get the degree, but then to land a job where you'll put that experience to use. Otherwise, you'll be sharing a cubicle with DataWorker and feeling like a fool.

That said, if you don't know whether you need a PhD, that means you probably don't know what kinds of problem you want to work on. And in that case, there's a good chance you'll end up working on a problem that only interests your advisor and nobody else (most PhD advisors have more students than they have good problems to work on). In that case, I wouldn't recommend it.

Baghard 3989 days ago

I've had complete opposite experience. Do the people who hire for this kind of work often bet on non-PhD candidates? Do they trust themselves to separate the wheat from the chaff?

Don't you want a colleague who is able to mention seminal papers for specific problems? Who is able to read and understand these papers and can distill useful features and optimizations from them?

People with PhD who go into business, usually end up in the better positions. They hire other PhD's for the good positions to keep the signal (mastery of the content) stronger.

As someone who did a lot of work with data I have little problem with my usefulness, but a lot of problems opening doors to the really interesting data companies (lacking a proper academic network). I wish I had gotten that PhD, because right now applying to Google, Microsoft, Facebook, Yahoo or eBay for data science positions makes me look like a fool.

GauntletWizard 3989 days ago

I've met a lot of fools who've quoted all the right works, in both Computer Science and Data Science. Computer Science fools usually get fired. Data Science fools seem to get promoted to Yes-man status. It's a lot harder to lie about your code than it is with statistics; As the old adage goes, it right behind Lies and Damned Lies.

threatofrain 3989 days ago

You've said "No", but you haven't countered the posters claims. Are in fact most data scientists PhD/Masters people? I hear the same information at a mid-sized tech company. I also hear similar things about Intel.

searine 3989 days ago

>Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.

This is why you are a DataWorker, and not a dataScientist.

Anyone can push bits around. It takes a trained mind to corral them using careful experimentation and observation.

washedup 3989 days ago

It's dangerous to make big generalizations like "no one actually caring if the statistics are valid." This simply is not true. Sure, a lot of what you see on /r/dataisbeautiful is garbage, but that's because it's an open forum where anyone can show what they think they have found. Usually, whenever someone makes an egregious statistical error, they are called out for it. Of course, the same happens on larger scales and even in published research.

"Data science" at it's core is just statistical analysis, but it has been slowly morphing over the past few decades thanks to the budding field of machine learning and the commoditization of computing power. This has drastically changed the field of statistical research, and although the underlying math is the same, the tools and the amount of data are constantly in flux. Someone along the way must have felt that this evolution of statistical analysis needed a new name. In all honesty, it's just a name, and it doesn't matter. What matters is if you understand how to use it.

probdist 3989 days ago

The classic venn diagram of data science is still helpful: http://drewconway.com/zia/2013/3/26/the-data-science-venn-di...

This article reads like a way to find yourself in the danger zone.

jwuphysics 3988 days ago

I've never seen this venn diagram before--thanks for bringing it up. I find that, as an academic (pursuing a Ph.D. in astrophysics) that plenty of traditional researchers are able to hack together code (many haven't ever taken a formal programming course; http://arxiv.org/abs/1507.03989) but many also misuse or can't interpret statistics (from personal experience). That puts us in the danger zone!

I think the key is to find the mathematics and statistics interesting because you want the [data] science to be meaningful. If that's a driving force, then you can learn math and statistics on your own (like the author did). Otherwise, yes--you will find yourself in the danger zone.

ForHackernews 3989 days ago

What if you just want to get paid well to play with interesting tools?

facepalm 3989 days ago

I assume data science is often used in place of astrology. People just want to have something to cling to, to get over their fears and insecurity. So if you can generate some reassuring graphs, who cares if they are based on solid statistics or not?

forgetsusername 3989 days ago

>People just want to have something to cling to, to get over their fears and insecurity.

The other side of this is that some businesses (especially SMBs) are so horrible at utilizing their data that very basic analyses can reap big gains (80/20 rule!). For the vast majority of businesses there is no need for elaborate models or machine learning techniques.

liviu- 3989 days ago

>see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/

They all seem very well-presented[0], but I can't help but ask - what do you do with this new information?

[0] https://i.imgur.com/PvWYB2n.png

minimaxir 3989 days ago

I am planning a blog post on the relationship between YouTube video duration on other statistics. So I did a little exploratory analysis to validate the data.

As shown, the distribution of durations in Music videos is much, much different than all other categories. As a result, it skews nearly every other analysis and I may have to exclude videos from the Music category entirely.

curiousjorge 3989 days ago

what you described is prevalent in social sciences. Ton of biases and causation/correlation error and putting blind trust in some arcane statistical analysis without knowing what they really mean. Conclusion: statistically significant is the magic word peppered throughout academic literature.

achompas 3989 days ago

Agreed. Fortunately, excellent social scientists really care about this -- see Andy Gelman's blog for many rants on this topic.

Balgair 3989 days ago

I'll echo for biology and sports sciences (think moneyball).

http://www.wired.com/2009/09/fmrisalmon/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/