Hacker News new | ask | show | jobs
by leecarraher 3223 days ago
It will depend on the level you plan to engage in the ML/AI space. If you just want a job in ML/AI , you are in luck. Due to the growing assortment of available, mostly to fully automated, solutions like Datarobot, H2O, sckit-learn, keras(w/ tensorflow) the only math you will absolutely 'need' is probably just Statistics. Regardless of what's going on behind the scenes with whatever automatically tuned and selected algorithm your chosen solutions uses, you will still need some stats in the end to show the brass that 'your' model works. the upside is that then you can spend time, learning feature extraction, data engineering, and the aforementioned toolkits, in particular what models they make available.

If you want to develop new techniques and algorithms, the the skies the limit, you'll of course want Stats too though.

3 comments

Can you recommend a Stats course that would be most relevant for people trying to be more practitioners (not researchers)?
I found this course to be very helpful, it has a good balance of reading material and labs to apply what you learn. The course is from the Austin’s Department of Statistics and Data Sciences.

[Foundations of Data Analysis](https://courses.edx.org/courses/course-v1:UTAustinX+UT.7.11x...)

Note: In this course, Dr. Michael J. Mahometa uses R. But I'd recommend you not to focus on R vs Python debates; the goal of this course is to learn about Statistics & Data Analysis in real-world scenarios. With that in mind, even just going through the reading material and lecture videos will be valuable enough if you're starting from scratch (but I'd recommend you to take the extra step and complete the Labs too).

This https://www.amazon.com/Probability-Statistics-Engineers-Scie... is the newer version of the stats book i had in undergrad, But @anst makes a good point about scikit learn. there is alot of good math to learn just from the docs and you can then investigate further on wiki, quora, stackexchange.

for the what's up in Data Science i like datatau.com. and there are some great podcasts too, like datascienceathome and partiallyderivative (there are lists).

Please just recommend the best online stats course you know of as a general toolbelt-notch.
There's a series of courses on Coursera, part of a Specialization from Duke titled something like "Statistics and Probability with R" or something like that. I've taken the first few classes in that series and have found them pretty good. The class on Bayesian Statistics is a little more difficult, but not too bad. I'll just say that you might want to complement the class with another book or other references on Bayesian stats. I've used this book:

https://www.amazon.com/Bayes-Rule-Tutorial-Introduction-Baye...

What "maths" is keras? Or scikit-learn? For what it's worth, to understand scikit-learn doc/tutorial I'd say you'll need Probability, Linear Algebra, Multivariate Calculus and, yeah, Stats. Not necessarily at a PhD level but still. And more you understand maths farther you can get in AL/ML.
These libraries leave most of the actual day to day work for ETL. ETL happens to be highly data and problem dependent, so it can't be easily automated or reused. For this reason I think the best thing to be a good applied ML person is a solid programming background. You should have a working knowledge of statistics and linear algebra, but the most useful skill really is being able to write good code. It's different for research of course.
Those are ML and AI frameworks that use a tremendous amount of mathematics under the hood, but you can also reliably treat them as blackbox learning systems too. Understanding the model generation procedure and setup is often unneeded. And many tools will help direct you toward what algorithms makes the most sense for your data, and even have competitions to figure out which actually works best. I agree, it's a little disappointing, but admittedly it doesn't take a PhD to do this stuff anymore.

It is important to note that just because you can do all the stuff a PhD Scientist might regularly do, doesn't mean that someone will hire you for it. In that case you might need to have a PhD in mathematics, computer science or a related field. But that is more a consequence of competition and long term talent investment, than the practice of ML/AI itself.

Competition (labor supply side) and ultimate success of current ML approaches.

As the market starts to overheat, it seems that there will be a labor shortage/good quality workers will be scarce and we'll have to make simple tools for simpletons. But this is all a huge "if". Eventually the market will contract a lot and slack labor market conditions will have companies hiring them PhDs.

It's not just competition: a clear understanding of what happens under the hood will make you a better user of the tool.
Want to try using a completely automated black-box ML pipeline like TPOT? Go right ahead. Good luck selling it to your product manager.
Can you please expand on this comment for us ML/AI naive?
There exists a tool called TPOT (Tree-based Pipeline Optimization Tool) [0] that aims to automate the knob-twiddling that tends to go with optimizing Machine Learning models. As these models often have a number of parameters to tune and tweak over large scales, such a tool can be useful to identify performant combinations of these parameters and save time in doing so.

However, many ML practitioners are wary of similar automated ML pipelines, especially as they focus on non-expert users. A huge part of "data science" is the "data" itself. It often has idiosyncrasies and quirks that must be identified and accounted for in any model that hopes to make useful predictions. There are many pitfalls that come from not understanding the base statistical/mathematical assumptions of these tools, and a simplified Automatic ML Suite runs the risk of providing misleading results when used as a one-size-fits-all solution. Even for expert users, such tools often make it difficult (either by mathematical need or software design) to interpret the reasons and causes for their results. "Black boxes" like this are definitely hard to sell up the chain.

[0] https://github.com/rhiever/tpot

Thanks for clarifying.

These tools do, however, have an important place in saving practitioners time and energy on the "knob-twiddling". It's a little like robot-assisted surgery: the robot doesn't actually do the surgery, but it makes the surgeon's job a whole lot easier.

That is making the assumption that the person using the tool is a surgeon (an expert in the field who could function independently if needed) which is not who the targeted demographic of such tools is. No-one who understands ML to some non-zero extent would use a plug-and-play ML tool, given that there is ML left to do otherwise. A better analogy would be a janitor activating the red button of the robot machine, which then does its complex surgery where if something goes wrong, the janitor would not be able to replace/understand the problem other than trying to restart it/kick it.
Perhaps, but the meta/hyper-optimization techniques used to implement TPOT, AutoML, etc. are perfectly valid replacements for grid search and stepwise feature selection.