Hacker News new | ask | show | jobs
by stcredzero 2960 days ago
The field is absolutely saturated with people who want to be a data scientist but have no experience. This is where some of that gate keeping comes from.

What are the data science "gotchas?" A lot of people can pick up basic programming in a weekend, but they wouldn't necessarily know what they don't know and might well get deeply mired in problems with concurrency or algorithmic complexity.

It's such "gotchas" which justify gatekeeping. Otherwise, gatekeeping is just unproductive manipulation of the market.

4 comments

There are largely three branches of data science jobs, each with their own typical gotchas.

1) Data engineering. I suck at this, don't ask me.

2) Inference. One big gotcha is often of the form of not accounting for all the sources of variation in your estimator and thinking you have something when you don't (often coming from unaccounted sources of correlation in time or space or repeated measures). Another is that correlation isn't causation. This pops up in surprising ways. Or things not being as independent as you thought.

3) Prediction/classification. Gotchas take as many forms as the things you look at, but the birds eye view is that you apply a method and it works ok, but either not well enough, or you then try it in the real world and it doesn't generalize as well as it did on your test set. The ways models break down depend heavily on the model and the data, so the way to diagnose and fix the issue depends on both understanding your toolkit really well and understanding the context of the data (business logic, etc). Another gotcha is in understanding uncertainties of your predictions. If I predict that this word is a noun, how sure am I of that? Many beginners skip those kinds of questions, but don't realize it.

I'm a data scientist with (barely) a bachelor's in physics working with mostly PhD's and, while the academic degree based gatekeeping is bad and frustrates the shit out of me, I get why it's there. The learning investment to learn the basics is dwarfed by the learning investment to be able to flexibly apply the right things at the right times and tweak/fix them as appropriate.

I mean, in my graduate ML classes the first "homework" assignment was always just a problem set of probability and distribution questions. They were simple questions, but you'd be surprised how many people dropped the class after that first assignment.

There are a lot of people out there who have memorized how to implement k-means and PCA but would absolutely struggle to understand what they're actually doing or interpret the results in a meaningful way. A HUGE part of being a data scientist is presenting information in a useful way. That's why PhDs are favored because with their experience having to write grants to get funding for their research they're exactly the type of people that can take a naive problem, work it to a result, and then sit in front of a board room of non-technical people and explain why their result was worth the money that was given to them.

That's why PhDs are favored because with their experience having to write grants to get funding for their research they're exactly the type of people that can take a naive problem, work it to a result, and then sit in front of a board room of non-technical people and explain why their result was worth the money that was given to them.

My wife has a PhD in comparative lit, but she now works in banking. Her PhD gave her a superpower: Reading. She can quickly absorb large amounts of text with abstruse, complex, and subtle distinctions and tell you very nitpicky things about it. She thinks financial/banking regulations are light reading in comparison to the stuff she waded through to get her PhD. (It's also quite surprising: the number of C-level people in banking who have little patience for reading. She's won a number of boardroom battles because she has actually read things.)

I think training in literary analysis is a secret superpower for life. It gives people tools to make reliable inferences from text about motivations, assumptions, biases, etc., and to spot attempts to use phrasing to hide or obscure things. Great for reading emails, contracts, reports, etc.--and for editing your own writing for clarity (or not, if that's what you're going for...).
> I think training in literary analysis is a secret superpower for life

what exactly do you mean by literary analysis here? i have (in my opinion) extremely good reading comprehension skills, in that i can read and understand the literal meaning of almost any text (provided i understand the context), and i got an 800 on the critical reading section of the SAT. on the other hand, i can't for the life of me read a book and pick out any of the major themes without having them spoon-fed to me. i was always terrified when i was expected to have my own opinion about a text to use as the topic for a paper.

The surface area of "gotchas" in data science is much larger than software engineering because you have all the "gotchas" from programming/software engineering AND all the new "gotchas" from data science.

For data science the big one is over fitting, which everyone talks about, but can happen in really insidious ways in production. You have to be very disciplined and careful with the data to prevent over fitting.

Another big one is productionizing data science, which in my opinion most data scientists don't have a ton of experience with.

The actual training of the models part of data science isn't that hard, its actually making it work with the crappy data that exists in the real world and putting it into production that are the really hard parts.

I think the obvious "gotchas" are problem definition (Am I formulating the problem in a way that will allow me to create value? a concrete example: am I modeling churn correctly?), overfitting, target leaks, and model trouble shooting / improvement (i.e. the model is doing OK, can it do better? How much better? How do we get there? Remembering that small performance gains can mean big $ at scale). On the reporting side, how confident that what I'm reporting is real? This is where the "science" training is helpful. Programming experience is relevant in the sense that implementation is important, i.e. it's far too easy to introduce critical target leak bugs when engineering features.

Of course we can abstract the root argument; for a given job, among those qualified to fill that job, there exists at least one person who has auto-learned the skills required to perform the job. This is probably true.