Hacker News new | ask | show | jobs
by imh 2960 days ago
There are largely three branches of data science jobs, each with their own typical gotchas.

1) Data engineering. I suck at this, don't ask me.

2) Inference. One big gotcha is often of the form of not accounting for all the sources of variation in your estimator and thinking you have something when you don't (often coming from unaccounted sources of correlation in time or space or repeated measures). Another is that correlation isn't causation. This pops up in surprising ways. Or things not being as independent as you thought.

3) Prediction/classification. Gotchas take as many forms as the things you look at, but the birds eye view is that you apply a method and it works ok, but either not well enough, or you then try it in the real world and it doesn't generalize as well as it did on your test set. The ways models break down depend heavily on the model and the data, so the way to diagnose and fix the issue depends on both understanding your toolkit really well and understanding the context of the data (business logic, etc). Another gotcha is in understanding uncertainties of your predictions. If I predict that this word is a noun, how sure am I of that? Many beginners skip those kinds of questions, but don't realize it.

I'm a data scientist with (barely) a bachelor's in physics working with mostly PhD's and, while the academic degree based gatekeeping is bad and frustrates the shit out of me, I get why it's there. The learning investment to learn the basics is dwarfed by the learning investment to be able to flexibly apply the right things at the right times and tweak/fix them as appropriate.