Hacker News new | ask | show | jobs
by stillsut 3443 days ago
The roles of statistician and data scientist are not substitutes but more like complements. This guy definitely is a data scientist. Here's some ways to tell:

- Works on non-mission-critical components, e.g. he's not doing statistics for the when the wing will fall off your airplane, but he can help you figure out business problems more open to interpretation, e.g. subject line open rates.

- His publishing tools favor flair over convention, e.g. Ctrl+f for "latex" has zero results, but he does have D3, C3, Bokeh, surprisingly no tableau.

- Not sure he even references a single classical statistics package. The vast majority of people publishing in social sciences or "old school" life sciences are using Minitab, JMP, R, or SAS (correct me if I'm wrong, please, it's an outsider's perspective).

This skillset is not inherently "cutting edge!"- or deceptively "all talk, no walk". They really are completely different roles, that use some of the same tools and formulas and jargon. To cut to the heart of it: When a company builds a plane and says "I wonder how unlikely it would be for the wing to fall off?" that creates the demand for a statistician. When a company is trying to out-compete others, or maximize profit/charitable-effectiveness, often in a service or a field that is heavily influenced with human psychology, that creates the potential for a data scientist to add value.

1 comments

I knew I was forgetting packages. I do in fact use Tableau. Will add it. Thanks for the catch!

As for LaTeX, it would have never occurred to me to add it. I have no idea why not, but it doesn't. Maybe because it feels more like a chore than a tool. It's like an anti-tool. I mean, I do or did in the recent past use LaTeX, but in more recent years I would farm that out to someone junior to me who hadn't worked with it for long enough to prefer pouring bleach in their ears to being faced with tweaking one more broken LaTeX template.

I probably should include classical stats packages. They really should go in here. But I've been coding since I was a kid and typically eschewed classical stats and math packages because of my perception that they were slow walled-gardens, and that as soon as I had a method figured out in Matlab or SPSS I'd end up rewriting it in C, C++, or Java to make it work with other things or at scale. That was hammered home in the first company I worked with where we did modeling in SAS and then rewrote every model in Java because SAS couldn't keep up.

I'm not suggesting that classical stats packages aren't data scientists tools. I think they are. They're just not my tools because of the curious niche I found myself in.

I think my job is similar to yours. My background is in engineering at an industrial manufacturing plant.

I have some of the same issues. The Engineers here tend to reach for spreadsheets first (or Access databases - these things are everywhere at my work) and inevitably they run into scaling problems and end up with a huge bloated mess. I step in to re-architecture these monstrosities (using "real" databases when necessary).

The other big part of my day to day work is modelling and data analysis. Usually regression based stuff and LP optimization problems (SAS is very good for this) especially around yield and quality control. The venerable excel "solver" plugin is often abused very heavily by engineers and is not always the ideal solution.

The person who I took over from was a Stats guy and the original job title was "Process Statistician" my boss has since retitled my role "Data Management Engineer". I still think of myself as an engineer first and foremost and a "data" person second.

I use SAS heavily. We have kind of gone in the opposite direction to you. I have rewritten some of our models in the past from C++ into SAS mostly for ease of maintenance because SAS is better understood by the non programmers (Most of the Engineers here do not have a programming/CS background and those that do tend to either know Fortran or Visual Basic very few grasp C/C++ very well). Speed is not really any issue but opaqueness and ease of maintanece is.

I'd like to learn R because I have heard it is very similar to SAS but more transferable to outside companies. Julia is the other language I've got my eye on I have heard it is somewhat similar to MATLAB which is used for some modelling work here.

sometimes i write python packages to auto populate tex files. like imagine running LDA with 50 topics and showing how each topic (via word cloud) correlates to an outcome variable

then it starts to become a tool :)