| I do machine learning and would consider myself a data scientist. I was an engineer who decided to do an advanced degree in statistics and computer science just because I liked this stuff. I currently work in the analytics division of a small company: 1. Its not necessary to have the necessary degrees: I did but a lot of people in my team come from the social sciences and other fields. You might find it hard to cross HR but that is something that can be rectified by cleaning up the "weird parts" of your resume and highlighting the "right parts". You seem to have a good handle on what is what on that front. 2. Your statistics, linear algebra and probability skillset need to be upto par. People from a more statistical background will grill you on those things. Its extremely easy to see whether a person can think statistically by giving them a toy data problem and asking them to hack at it. The way to train for it is to play around with small datasets and I see you have been doing that a bit. 3. People who come from a more C.S. side of things will try to explore your knowledge about "machine learning algorithms" which typically are easy to learn if you know your math background. The field has a lot of jargon which might appear to make it fancy. Again, the math behind these algorithms is not hard but there are things that you learn about how these algorithms work in practice that really make a difference. So again doing small projects and putting them up on github will help you learn more and make your resume look good. 4. Technology: There are loads of languages that are used in practice. Make sure you know one scripting language (R, SciPy/NumPy or even Matlab) and are comfortable using that as your scratch pad. The people who are statistically oriented in my team use R. Other skills that are extremely valuable but won't kill you to know are to learn the Map Reduce Stack (Java (uggh), Pig). I am currently doing machine learning on a dataset. This involves typically playing around with the data in NumPy and sometimes Matlab. Once I am comfortable with a particular choice of algorithms, I try to write it up in Pig. I use Java (Hadoop) for the worst case scenario. Hope this helps... |
2) This is really my weakness since my stats and linear algebra are passable but not great. There are several free datasets (mostly from data.gov) that I've been playing around with. Should I "publish" the results of my practice studies on a portfolio-esque site? Or is it sufficient to just know the techniques well enough to answer interview questions?
3) I'm fairly well acquainted with machine learning - my interest in machine learning is one of the driving forces for me to take up neuroscience as a career.
4) Great information, thanks. I'm glad to see people use technology more as a scratchpad and less as a regimented "You must know XYZ tech stack". R and SciPy are on my to-do list, I'll add Map Reduce.
As a machine learning guy in an analytics department, are you hunting through internally generated numbers to find trends (like sales, ad placement, etc?) Or are you hunting through externally generated data to find new trends/products/markets?