| Good article for beginners. A couple thoughts, just to build on what the author said: First off, data science == fancy name for data mining/analysis. Wanted to clear that up due to buzzwordy nature of "data science." Learn SQL - this is the big one. You must be proficient with SQL to be effective at data science. Whether it's running on an RDBMS or translating to map/reduce (Hive) or DAG (Spark), SQL is invaluable. If you don't know what those acronyms mean yet, don't worry. Just learn SQL. Learn to communicate insights - I would add here to try some UI techniques. Highcharts, d3.js, these are good libraries for telling your data story. You can also do a ton just with Excel and not need to write any code beyond what you wrote for the mining portion (usually SQL). I would also go back to basics with regards to statistical techniques. Start with your simple Z Score, this is such an important tool in your data science toolbox. If you're just looking at raw numbers, try to Z-normalize the data and see what happens. You'd be surprised what you can achieve with a high school statistics textbook, Postgres/MySQL (or even Excel!), and a moderate-sized data set. These are powerful enough to answer the majority of your questions, and when they fail then move on to more sexy algorithms. Edit: one more thing I forgot to mention. After SQL, learn Python. There are a ton of libraries in the python ecosystem that are perfect for data science (numpy, scipy, scikit-learn, etc). It's also one of the top languages used in academic settings. My preferred data science workspace involves Python, IPython Notebook, and Pandas (This book is quite good: http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython...) |
BTW, you can make interactive visualizations in pure python with bokeh: http://bokeh.pydata.org/en/latest/
Also with Blaze, you can use Pandas (or even Dplyr) syntax in python to query Hive, Spark and other large stores. http://blaze.pydata.org/en/latest/