Hacker News new | ask | show | jobs
by nickdavidhaynes 3406 days ago
>If data science just becomes a code word for brogramming your way through a set of black-box ML algorithms, then I will welcome the inevitable crash of data science.

A fundamental challenge I see here is how bottom-heavy data science feels now. There are tons of people out there trying to "get into data science" from other fields, but the number of people with substantive domain knowledge, strong programming skills, and the math background to be able to understand the ML black boxes is quite small relative to the number of people calling themselves data scientists. In other words, real insight definitely is (or should be) the goal, but real insight is really hard, and scikit-learn is so easy.

My hope is that this improves over the next 5-10 years - the more mature data science becomes as a discipline/career, the better the education will be and the more experienced people there will be. There is a risk in the mean time, though, that a flood of relatively inexperienced people causes a collapse in expectations for data science, making businesses less eager to hire them in the future.

2 comments

Strongly disagree. Maybe thats the case the a huge company, but most small organizations I've worked with are extremely top-heavy, filled with STEM PhD's who are very capable, but require 1-3 years to get a useful result and aren't often familiar with programming best practices or how to turn their results into a product. You need a larger team of engineers to make that happen and if there's a large overlap between engineers familiar with machine learning, that transition is much easier.

Furthermore, there's a number of practitioners that expect their data to be ready for them in some perfect state. Probably a majority of the task is create a pipeline for acquiring data and labeling it appropriately if necessary, which may require developing some ontology or classification with rigid guidelines such that someone in India can delegate the task to a large team. Then the practitioner spends an inordinate time optimizing some heuristic that has a meaning that drifts over time, or is completely inconsistent with the goals of the product. These are both problems outside the realm of domain knowledge or experience.

Sorry, I might not have been clear about what I meant by "bottom-heavy". I think we actually agree - as someone who's hiring for DS roles right now, I've seen a ton of exactly what you're talking about.

-Some candidates can write great code, but don't have the math background to understand what ML black boxes are doing.

-Then there are STEM PhDs that have never written non-research (i.e. maintainable) code or had to formulate a qualitative business problem into a quantitative problem they can solve.

Both types of candidates need to come in at a "junior" level and do some on-the-job learning in order to be fully successful data scientists. IMO it appears to be easier to teach STEM PhDs how to code than programmers how to do math, but that might be personal bias (since I came from the former group).

Wonder if the finance roles of quants and quant devs will spread to other industries. Quant devs are math heavy programmers that might not do original research but still can understand/calibrate/implement the models the pure quants produce. Ie given an abstract paper with a shiny model (or a hacked together spreadsheet...) the qdev might need to analyze what monte Carlo error correction strategies are relevant for the problem or how a certain market's peculiarities might influence calibrations etc.

Also, quant devs are heavily involved in building the calculation engines that invokes the models. These engines handles real-time dataflow and calibrations etc and are often highly non-trivial.

My guess is that that type of role is relevant in a data science context. This is much more than data cleansing and piping data between databases.

Heck, when I was in school of CS degree, some people from literature undergraduate went straight to CS graduate programs without too much a pain.

Tuned out programming never really required much math background, it is the level brain teaser that programming posed is as much as math education. So anyone who's has survived math advanced degree would take program like piece of cake, but it doesn't mean people from non-STEM background is hopeless to master data science.

Yet it's a joke to refer data science without referencing to advanced math concept. Albeit significant domain knowledge, data science is not just business analysis aided with spreadsheet. Modelling is an essential part of.

> Both types of candidates need to come in at a "junior" level

Then what's the point of the Ph.D.? Why not just go straight from B.S. to junior data scientist then?

In theory, any programmer worth their salt would already know a massive amount of math (comparatively) and should be readily capable of learning more. If you program without a solid understanding of the underlying math, you're not programming. You're typing until it compiles.
Disagree. As someone who knows more mathematics and less programming than the average programmer I'd say the average programmer need not know all that much mathematics at all, if they're not working in a particular area that involves mathematics.
You must already know vector math or be capable of learning it in less than a day. If you don't have that aptitude, then I put you at higher in the stack.
Vectors aren't terribly advanced mathematics.
You're joking, right? What math, algebra 2 math?
I come from a quantitative social sciences background. I won't math quite as good as your average STEM PhD, but I like to think those of us in social sciences do a pretty good job of building good questions to parameterize squishy qualitative business objectives.
From my experience the biggest hinder to the future of data science is how crappy it is to learn statistics. And I think this is why a lot of data science courses stop at Z-tests and p values or super basic Bayes theorem. I think mathematicans and statisticians have a lot of work to do to make more advanced parts of the field more accessible, otherwise we will end up with people ignoring important assumptions and using tools like a black box.
To be fair, learning statistics is hard for the same reason that doing statistics is hard - any statistic involves assumptions, and the different assumptions underlying different models can be very subtle. There's a lot of disagreement among even professional, academic statisticians about fundamental concepts like p values [1] and how to quantify uncertainty under multiple hypothesis testing [2]. Unfortunately, I don't see any of this getting easier any time soon (although I would love to be proven wrong).

[1] http://www.tandfonline.com/doi/full/10.1080/00031305.2016.11...

[2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1112991/

Moving away from Null Hypothesis Testing and towards a more Bayesian approach is a good first step. For me, and I'm sure many others, NHT is a very backwards way of approaching inference. I don't care about an imaginary distribution with mean 0, I have real data I can fit to a distribution directly--what can you tell me about it? Conditioning on the data itself rather than an unobservable parameter of interest is much more intuitive and makes it much easier to report results to non-statisticians.
I completely agree; I've found it much harder to self-learn the stats than the software side of things. Sibling post makes a good point, but I think the history of stats vs. comp sci bears weight here too; having many people want to learn stats outside academia is a much newer phenomenon than people doing the same with programming.

Anyone have any good resources for self-teaching stats? I have a BS in math but only took one stats course, and it was as terrible as all intro-stats classes are. I have a strong, proof-based understanding of probability theory, but haven't found a similar approach to stats. It all seems to be "if data looks like this, use this test, watch for these pitfalls" which is terrible for building intuition.

Try the Khan Academy stats resources - https://www.khanacademy.org/math/statistics-probability

Datacamp also launched a bunch of new stats courses recently. I haven't checked them out yet, but their courses are usually good quality. https://www.datacamp.com/courses/topic:probablity_and_statis...

If you like proofs and rigor, take a look at "Statistical Inference" by Casella and Berger.