Hacker News new | ask | show | jobs
by pvnick 3988 days ago
Good article for beginners. A couple thoughts, just to build on what the author said:

First off, data science == fancy name for data mining/analysis. Wanted to clear that up due to buzzwordy nature of "data science."

Learn SQL - this is the big one. You must be proficient with SQL to be effective at data science. Whether it's running on an RDBMS or translating to map/reduce (Hive) or DAG (Spark), SQL is invaluable. If you don't know what those acronyms mean yet, don't worry. Just learn SQL.

Learn to communicate insights - I would add here to try some UI techniques. Highcharts, d3.js, these are good libraries for telling your data story. You can also do a ton just with Excel and not need to write any code beyond what you wrote for the mining portion (usually SQL).

I would also go back to basics with regards to statistical techniques. Start with your simple Z Score, this is such an important tool in your data science toolbox. If you're just looking at raw numbers, try to Z-normalize the data and see what happens. You'd be surprised what you can achieve with a high school statistics textbook, Postgres/MySQL (or even Excel!), and a moderate-sized data set. These are powerful enough to answer the majority of your questions, and when they fail then move on to more sexy algorithms.

Edit: one more thing I forgot to mention. After SQL, learn Python. There are a ton of libraries in the python ecosystem that are perfect for data science (numpy, scipy, scikit-learn, etc). It's also one of the top languages used in academic settings. My preferred data science workspace involves Python, IPython Notebook, and Pandas (This book is quite good: http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython...)

6 comments

Great comment.

BTW, you can make interactive visualizations in pure python with bokeh: http://bokeh.pydata.org/en/latest/

Also with Blaze, you can use Pandas (or even Dplyr) syntax in python to query Hive, Spark and other large stores. http://blaze.pydata.org/en/latest/

This is the truth. People can't do simple statistics, even with advanced degrees. In many cases advanced degrees make things worse. The ability to reason about data and have strong fundamentals in math statistics is what's needed.

Someone else mentioned Gelman's blog. That's a great place to find evidence that phd's do not lead to an increased ability to ferret out "truth" or insight from data. In many cases they just hide the mistakes so that others without that background don't know they're being misled.

But how do you become a good data scientist, instead of a technical person, that knows how to apply an algorithm in Python/R.

What I am trying to ask is how do you become good at setting your start point(formulate your hypotheses), communicating your insights and selecting which tools apply where, because if your are good at coding and have experience in things related to computer science you have the abilities to handle a dataset(SQL Knowledge) and the data tools(Python, Pandas, etc), but that doesn't earn you the title of data scientist.

Practice. And education, but mostly practice. This is the kind of thing that is typically taught in formal educational settings (at least in engineering, which is my experience). As an example, I learned more about probability & statistics in 1) AP biology in high school, and 2) a "simulation systems" class in my industrial engineering master's curriculum. We spent much of the former class learning basic statistical analysis techniques (ANOVA, chi-square, etc) to apply to our lab data, and the latter class was all about statistical analysis of process flows (aimed at the real life problem of factory production planning & scheduling and manufacturing process optimization).

So, do I consider myself a data scientist? Absolutely not. But do I understand basic statistical concepts and know how to apply them to several categories of real life data analysis problems.

I'm a terrible coder, btw.

Would you recommend any approach or I should go undust my high school and college books in the search for study material. Or is this too basic material.
I would shift priorities to Python (from SQL).

Unless one has "data scientist" title so to make "database engineer" look more fancy, then data comes in various shapes and forms. And most questions cannot be answered with a simple aggregation.

For example, data I work on (I am a data scientist freelancer) is flat csv files, xls files, JSON files, some text files I need to parse, various SQL, MongoDB, things I am getting from various APIs, etc...

While understanding joins is crucial (and normal forms, etc), SQL itself does take negligible amount of my time (and effort).

I would disagree with that advice. If you work as a data scientist in a company, you will likely have the logs of something stored in an SQL table (be it pure SQL database or something like hadoop hive) and you will have to answer (and ask) to questions like: "Do people convert more when they come from X or Y?", so you will have to do a couple of queries to get the conversion rates from people coming from X and Y.

This is my experience when I worked as Data Scientist about a year ago. Now, YMMV, especially if you're a freelancer, I guess your clients are more comfortable with giving you raw dumps of data as files instead of giving you access to their database servers.

I work as a freelancer. And actually, I never ever processed logs.

Of course, sometimes I am given SQL access to server; but I never learnt SQL except for in action (i.e. things which I need right now).

And most of times I work with flat files. Even if they come from SQL they typically need a serious preprocessing before I can do a more adv analysis.

BTW: I have no problems with composing rather advanced queries. Just if SQL is a problem from someone (and, in case of doubt, it can't be Googled in no time) then I am curious how can get machine learning.

Regarding SQL, have you noticed any increase in the usage of "window functions" (how important do you find them for your work?)
You're basically describing stuff I was doing like 20 freaking years ago. Minus the Hive & Spark & Highcharts & d3.js - naturally.

But back then I couldn't get any of my managers to understand or appreciate what I was doing. Fickle finger of fate.

Hell even William Gosset was doing "data science" when he popularized the Student T distribution while working for the Guinness Brewery back in 1908.
I lived through Statistics, Business Analysis, Decision Analytics, Data Analytics, Data Mining now Data Science. Same thing renamed over and over again.

Regarding post above, it's right. Data scientist is someone better at statistics (classical stats, bayesian, machine learning) than computer scientist, and better at programming (SQL, R/Python for building models) than academic statistician. Plus a teaspoon of visualization (ggplot or d3).

AI has gone through the same sort of buzzword treadmill and even programming in general. Only after living through a few cycles does it really become obvious how cyclic these sorts of trends are.

I'm trying to work on being less jaded about it, and not letting my annoyance with the-new-trendy-thing-that-i-remember-doing-years-ago-under-a-different-name get in the way of learning new technology and new lessons.

But it's a struggle.