| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mordant 3443 days ago
	'Data scientist' is just title inflation by statisticians.

4 comments

pjmorris 3443 days ago

Some say [0] it's title deflation for statisticians.

[0] http://bactra.org/weblog/925.html

link

nonbel 3443 days ago

"Statisticians" taught everyone NHST, and relegated bayesian probability to the appendix for decades. Once you realize what has happened there, you will view that title with very little respect.

I am glad to see machine learning, ai, "data science", whatever, grow as a separate field. The statistics programs had their chance.

link

paulgb 3443 days ago

There are cases where this may be the case, but did you look at the tools in the blog post? Can statisticians be expected to write mongoDB code, create a web scraper, and make interactive visualizations in D3?

Title inflation exists, but there is a real-world role here that isn't really captured by "statistician" at all.

link

ianai 3443 days ago

If you're in a statistics program you're going to learn to code. That's been my experience anyway.

link

jeffheard 3443 days ago

I think it's great that students and young professors in the sciences are taught to code now. I've even taught some of them.

To me, data science is more than understanding statistics, it's been essential to know how to scale them up and out.

If you're a domain scientist, you won't necessarily learn how to write reusable tools that are performant (or runnable) on data that is different from your initial model data. I once worked with a group whose model had grown so unwieldy that their config file was in NetCDF.

I found my niche was often in doing things that were slightly (or completely) outside the comfort zone of most domain scientists who were competent coders themselves, but who didn't have the funded time nor the inclination to learn things like database, visualization, and networking technologies that became necessary either to share their work with other research groups or to operate on larger datasets.

One project had me take a big model that was normally run twice a day and on a 4km grid and help write something that could run and visualize the results of the same thing on a 0.5km grid over a larger area and hourly. And then devise something that could help them visually explore the timeseries as it evolved, sometimes over months.

Designing the pipeline that can handle that is outside the scope of most scientists, even the ones who are good coders.

link

sbov 3443 days ago

That line you're talking about sounds more like the traditional science/engineering divide. Maybe staticians are data scientists, but what we call "data science" is really data engineering?

link

bertil 3443 days ago

Actually, I’ve noticed a meaningful distinction between people who learned statistics from machine learning (and are more likely to call each other data scientist) and statisticians (the least experimental of whom used to go by the title analyst): what to do when there is either too little, or too noisy data. Interestingly, those two are happy to be called Data scientist, but in my experience, they rarely meet.

A traditionally trained statistician would evoke negative result and decide not to use the model and support to maintain the pre-existing approach. A machine learning expert might not care, apply the coefficient out of the model as is because they are presumably closer than a guess and is more likely to be openly skeptical of human expertise.

That has lead to some frustrating situation for me: me arguing we should censor things like negative speeds, while I was told that there was no problem because the results were regularised anyway. Building and picking proper factors to use in regression is something that you can partially get away with when having larger databases, and back-propagation can take over; before that, insights still do matter.

I have not meet many who can articulate that transition effectively.

It seems that you’ve met mostly the second category; they are possibly the larger group, but not necessarily the most influential. There is a core of people who are meaningfully different. The linked article seems to be from someone in between but closer to the second group.

link

thinkr42 3443 days ago

More like 'analyst' in how easily it is thrown around. Calling a built in function in python or R is just about equivalent to calling one in Excel. Sure, you can claim that folks need to know more about what is going on, but honestly, how many have actually gone through the work of deriving the functions they're calling to begin with?

link

lacampbell 3443 days ago

I'm wondering how useful deriving functions yourself is in the age of computers. I feel like knowing axioms about the mathematical structure you're dealing with and how to do proofs is very important, but it always struck me as odd that were still stepping through complex applied maths functions manually in pen and paper. Programmers don't bother say, writing our own hashtable implementation more than a handful of times in our lives, do we? Does forgetting how to derive hashtables mean we won't know how to use them effectively?

Genuine question - more than happy to be proven wrong.

link

curiousgal 3443 days ago

>stepping through complex applied maths functions manually in pen and paper.

We do that because:

A it helps us understand them better

B it teaches us how to think, the way Feynman said "Know how to solve every problem that has been solved". Granted, it seems pointless to work through what is easily accessible through machine BUT it teaches how to solve new problems. I wouldn't consider using NumPy or Matlab as the first step towards solving a new math problem.

It's like using Assembly vs using a higher level programming language.

link

thinkr42 3443 days ago

Completely agree. There's a lot of nuance in these algorithms, they're not as cut and dry as simply calling a package method and oftentimes they aren't optimized to your use case. I work in Machine Learning, specifically on NLP, and it is really obvious when interviewing potential employees who knows what SVD means and who just know the NumPy function. Most "data scientists" I've interviewed fall in the latter category.

edit-This is of course completely anecdotal experience.

link

lacampbell 3443 days ago

I suppose my real question is - how many times do we need to do it? Once we have stepped through it by pen and paper once, or derived the result, how many times do we need to keep doing it? My experience in that mathematicians will do this again and again and again.

link

laughfactory 3443 days ago

I agree. A smart data scientist doesn't waste their time reinventing the wheel: they build off the hard work of others. When necessary they can create what is needed, but they don't do so typically.

They are both more and less, in my experience, than statisticians (more flexible and solution-oriented, less rigorous and classical), than analysts (they can do more, in general, but a great analyst will be better at analysing and visualizing), than developers (they know more stats, less software engineering, and have great patience for wrestling data into submission). I like to think of data scientists as people who combine the skills of all the above to solve hard problems which exceed the domain of any of specialty (analyst, statistician, developer). It doesn't mean we're amazing at everything, just that we are effective, flexible problem solvers.

And for the record, machine learning, statistical modeling, and data mining are just a small portion of the pie. Being good at modeling and machine learning will not remotely guarantee success as a data scientist.

link

thinkr42 3443 days ago

I respectfully disagree. While I understand where you're coming from, I don't agree with your distinction between an analyst and a scientist. Given the data scientist's typical compensation and expected experience, there should be a higher bar set for them that does include developing solutions from base. I understand the use of utilities, but far too frequently I find people who rely on packages to do their work don't really understand what they're working on (they often don't realize the underlying assumptions that the package writers made for them either). With your description of the tasks for a data scientist, I would label this as a Data Analyst's work if I was hiring one.

I could of course be wrong and have a bit too narrow of a view from my particular subfield.

link

searine 3443 days ago

>how many have actually gone through the work of deriving the functions they're calling to begin with?

Why would you waste your time re-inventing a wheel.

A good data scientist isn't good because he/she can ace shitty trivia, he/she is good because they know the right question to ask.

link

achompas 3443 days ago

That's only part of it. A good data scientist is also good because they know how to answer hard questions.

In those situations math isn't "shitty trivia," but instead a tool to be leveraged against those hard questions.

You can consider the derivation of SVD to be shitty trivia while throwing np.linalg.svd around while engineering features. That's fine! Good luck visualizing that data in a meaningful way, or dealing with non-linear data, if you're ignoring that "shitty trivia."

link

marketforlemmas 3443 days ago

> dealing with non-linear data

What is non-linear data?

link

bigger_cheese 3443 days ago

Data derived from non linear inputs.

That is to say problems that can't be expressed by linear functions.

I.e. Y= mx + B is a linear function.

Y= ax^2 + bx + C is a polynomial (non linear) function.

Linear Programming (LP) involves solving a series of linear equations (something like Excel's Solver can do this).

When you are dealing with non linear functions you need to use a method such as Sequential Quadratic Programming (SQP).

link

q_revert 3442 days ago

Using a term like nonlinear science is like referring to the bulk of zoology as the study of non-elephant animals.

— Stanislaw Ulam

https://en.wikipedia.org/wiki/Nonlinear_system

link