"Statisticians" taught everyone NHST, and relegated bayesian probability to the appendix for decades. Once you realize what has happened there, you will view that title with very little respect.
I am glad to see machine learning, ai, "data science", whatever, grow as a separate field. The statistics programs had their chance.
There are cases where this may be the case, but did you look at the tools in the blog post? Can statisticians be expected to write mongoDB code, create a web scraper, and make interactive visualizations in D3?
Title inflation exists, but there is a real-world role here that isn't really captured by "statistician" at all.
I think it's great that students and young professors in the sciences are taught to code now. I've even taught some of them.
To me, data science is more than understanding statistics, it's been essential to know how to scale them up and out.
If you're a domain scientist, you won't necessarily learn how to write reusable tools that are performant (or runnable) on data that is different from your initial model data. I once worked with a group whose model had grown so unwieldy that their config file was in NetCDF.
I found my niche was often in doing things that were slightly (or completely) outside the comfort zone of most domain scientists who were competent coders themselves, but who didn't have the funded time nor the inclination to learn things like database, visualization, and networking technologies that became necessary either to share their work with other research groups or to operate on larger datasets.
One project had me take a big model that was normally run twice a day and on a 4km grid and help write something that could run and visualize the results of the same thing on a 0.5km grid over a larger area and hourly. And then devise something that could help them visually explore the timeseries as it evolved, sometimes over months.
Designing the pipeline that can handle that is outside the scope of most scientists, even the ones who are good coders.
That line you're talking about sounds more like the traditional science/engineering divide. Maybe staticians are data scientists, but what we call "data science" is really data engineering?
Actually, I’ve noticed a meaningful distinction between people who learned statistics from machine learning (and are more likely to call each other data scientist) and statisticians (the least experimental of whom used to go by the title analyst): what to do when there is either too little, or too noisy data. Interestingly, those two are happy to be called Data scientist, but in my experience, they rarely meet.
A traditionally trained statistician would evoke negative result and decide not to use the model and support to maintain the pre-existing approach. A machine learning expert might not care, apply the coefficient out of the model as is because they are presumably closer than a guess and is more likely to be openly skeptical of human expertise.
That has lead to some frustrating situation for me: me arguing we should censor things like negative speeds, while I was told that there was no problem because the results were regularised anyway. Building and picking proper factors to use in regression is something that you can partially get away with when having larger databases, and back-propagation can take over; before that, insights still do matter.
I have not meet many who can articulate that transition effectively.
It seems that you’ve met mostly the second category; they are possibly the larger group, but not necessarily the most influential. There is a core of people who are meaningfully different. The linked article seems to be from someone in between but closer to the second group.
More like 'analyst' in how easily it is thrown around. Calling a built in function in python or R is just about equivalent to calling one in Excel. Sure, you can claim that folks need to know more about what is going on, but honestly, how many have actually gone through the work of deriving the functions they're calling to begin with?
I'm wondering how useful deriving functions yourself is in the age of computers. I feel like knowing axioms about the mathematical structure you're dealing with and how to do proofs is very important, but it always struck me as odd that were still stepping through complex applied maths functions manually in pen and paper. Programmers don't bother say, writing our own hashtable implementation more than a handful of times in our lives, do we? Does forgetting how to derive hashtables mean we won't know how to use them effectively?
Genuine question - more than happy to be proven wrong.
>stepping through complex applied maths functions manually in pen and paper.
We do that because:
A it helps us understand them better
B it teaches us how to think, the way Feynman said "Know how to solve every problem that has been solved". Granted, it seems pointless to work through what is easily accessible through machine BUT it teaches how to solve new problems. I wouldn't consider using NumPy or Matlab as the first step towards solving a new math problem.
It's like using Assembly vs using a higher level programming language.
Completely agree. There's a lot of nuance in these algorithms, they're not as cut and dry as simply calling a package method and oftentimes they aren't optimized to your use case. I work in Machine Learning, specifically on NLP, and it is really obvious when interviewing potential employees who knows what SVD means and who just know the NumPy function. Most "data scientists" I've interviewed fall in the latter category.
edit-This is of course completely anecdotal experience.
I suppose my real question is - how many times do we need to do it? Once we have stepped through it by pen and paper once, or derived the result, how many times do we need to keep doing it? My experience in that mathematicians will do this again and again and again.
I agree. A smart data scientist doesn't waste their time reinventing the wheel: they build off the hard work of others. When necessary they can create what is needed, but they don't do so typically.
They are both more and less, in my experience, than statisticians (more flexible and solution-oriented, less rigorous and classical), than analysts (they can do more, in general, but a great analyst will be better at analysing and visualizing), than developers (they know more stats, less software engineering, and have great patience for wrestling data into submission). I like to think of data scientists as people who combine the skills of all the above to solve hard problems which exceed the domain of any of specialty (analyst, statistician, developer). It doesn't mean we're amazing at everything, just that we are effective, flexible problem solvers.
And for the record, machine learning, statistical modeling, and data mining are just a small portion of the pie. Being good at modeling and machine learning will not remotely guarantee success as a data scientist.
I respectfully disagree. While I understand where you're coming from, I don't agree with your distinction between an analyst and a scientist. Given the data scientist's typical compensation and expected experience, there should be a higher bar set for them that does include developing solutions from base. I understand the use of utilities, but far too frequently I find people who rely on packages to do their work don't really understand what they're working on (they often don't realize the underlying assumptions that the package writers made for them either). With your description of the tasks for a data scientist, I would label this as a Data Analyst's work if I was hiring one.
I could of course be wrong and have a bit too narrow of a view from my particular subfield.
That's only part of it. A good data scientist is also good because they know how to answer hard questions.
In those situations math isn't "shitty trivia," but instead a tool to be leveraged against those hard questions.
You can consider the derivation of SVD to be shitty trivia while throwing np.linalg.svd around while engineering features. That's fine! Good luck visualizing that data in a meaningful way, or dealing with non-linear data, if you're ignoring that "shitty trivia."
[0] http://bactra.org/weblog/925.html