Hacker News new | ask | show | jobs
by resolaibohp 2960 days ago
When my company posted a data science job we received something like 300 applications on the first day. When looking at them most were people with no experience but a fresh degree or bootcamp grad plus all the other people like you who want to change careers from software with portfolios of data science projects.

The field is absolutely saturated with people who want to be a data scientist but have no experience. This is where some of that gate keeping comes from.

The people who have it easy are the ones with a MS or PHD and years of experience doing data science work at companies under their belt. There are very few of these people right now.

There is this idea that data science is needed everywhere and there is a HUGE supply of jobs. As an example if you search for data scientist jobs at glassdoor.com in San Francisco there are ~2000 jobs. If you search for software engineer in San Francisco there are ~9000. Similar ratios can be found in any major tech city. Data science does not scale like software engineering in companies but the narrative out there is that this is the job to be in and there is this huge unmet need. It is all hype.

4 comments

The field is absolutely saturated with people who want to be a data scientist but have no experience. This is where some of that gate keeping comes from.

What are the data science "gotchas?" A lot of people can pick up basic programming in a weekend, but they wouldn't necessarily know what they don't know and might well get deeply mired in problems with concurrency or algorithmic complexity.

It's such "gotchas" which justify gatekeeping. Otherwise, gatekeeping is just unproductive manipulation of the market.

There are largely three branches of data science jobs, each with their own typical gotchas.

1) Data engineering. I suck at this, don't ask me.

2) Inference. One big gotcha is often of the form of not accounting for all the sources of variation in your estimator and thinking you have something when you don't (often coming from unaccounted sources of correlation in time or space or repeated measures). Another is that correlation isn't causation. This pops up in surprising ways. Or things not being as independent as you thought.

3) Prediction/classification. Gotchas take as many forms as the things you look at, but the birds eye view is that you apply a method and it works ok, but either not well enough, or you then try it in the real world and it doesn't generalize as well as it did on your test set. The ways models break down depend heavily on the model and the data, so the way to diagnose and fix the issue depends on both understanding your toolkit really well and understanding the context of the data (business logic, etc). Another gotcha is in understanding uncertainties of your predictions. If I predict that this word is a noun, how sure am I of that? Many beginners skip those kinds of questions, but don't realize it.

I'm a data scientist with (barely) a bachelor's in physics working with mostly PhD's and, while the academic degree based gatekeeping is bad and frustrates the shit out of me, I get why it's there. The learning investment to learn the basics is dwarfed by the learning investment to be able to flexibly apply the right things at the right times and tweak/fix them as appropriate.

I mean, in my graduate ML classes the first "homework" assignment was always just a problem set of probability and distribution questions. They were simple questions, but you'd be surprised how many people dropped the class after that first assignment.

There are a lot of people out there who have memorized how to implement k-means and PCA but would absolutely struggle to understand what they're actually doing or interpret the results in a meaningful way. A HUGE part of being a data scientist is presenting information in a useful way. That's why PhDs are favored because with their experience having to write grants to get funding for their research they're exactly the type of people that can take a naive problem, work it to a result, and then sit in front of a board room of non-technical people and explain why their result was worth the money that was given to them.

That's why PhDs are favored because with their experience having to write grants to get funding for their research they're exactly the type of people that can take a naive problem, work it to a result, and then sit in front of a board room of non-technical people and explain why their result was worth the money that was given to them.

My wife has a PhD in comparative lit, but she now works in banking. Her PhD gave her a superpower: Reading. She can quickly absorb large amounts of text with abstruse, complex, and subtle distinctions and tell you very nitpicky things about it. She thinks financial/banking regulations are light reading in comparison to the stuff she waded through to get her PhD. (It's also quite surprising: the number of C-level people in banking who have little patience for reading. She's won a number of boardroom battles because she has actually read things.)

I think training in literary analysis is a secret superpower for life. It gives people tools to make reliable inferences from text about motivations, assumptions, biases, etc., and to spot attempts to use phrasing to hide or obscure things. Great for reading emails, contracts, reports, etc.--and for editing your own writing for clarity (or not, if that's what you're going for...).
> I think training in literary analysis is a secret superpower for life

what exactly do you mean by literary analysis here? i have (in my opinion) extremely good reading comprehension skills, in that i can read and understand the literal meaning of almost any text (provided i understand the context), and i got an 800 on the critical reading section of the SAT. on the other hand, i can't for the life of me read a book and pick out any of the major themes without having them spoon-fed to me. i was always terrified when i was expected to have my own opinion about a text to use as the topic for a paper.

The surface area of "gotchas" in data science is much larger than software engineering because you have all the "gotchas" from programming/software engineering AND all the new "gotchas" from data science.

For data science the big one is over fitting, which everyone talks about, but can happen in really insidious ways in production. You have to be very disciplined and careful with the data to prevent over fitting.

Another big one is productionizing data science, which in my opinion most data scientists don't have a ton of experience with.

The actual training of the models part of data science isn't that hard, its actually making it work with the crappy data that exists in the real world and putting it into production that are the really hard parts.

I think the obvious "gotchas" are problem definition (Am I formulating the problem in a way that will allow me to create value? a concrete example: am I modeling churn correctly?), overfitting, target leaks, and model trouble shooting / improvement (i.e. the model is doing OK, can it do better? How much better? How do we get there? Remembering that small performance gains can mean big $ at scale). On the reporting side, how confident that what I'm reporting is real? This is where the "science" training is helpful. Programming experience is relevant in the sense that implementation is important, i.e. it's far too easy to introduce critical target leak bugs when engineering features.

Of course we can abstract the root argument; for a given job, among those qualified to fill that job, there exists at least one person who has auto-learned the skills required to perform the job. This is probably true.

I was recently back filling a data science position on my team and experienced the same thing. I was getting on average 5 new resumes a day and for the most part it was individuals who were currently in a data science MS program somewhere with no experience.
And you thought, "what a great opportunity to hire someone who has some fundamentals and a lot of potential for growth!", and hired the one that seemed most promising?

If I'm reading between the lines of your comment correctly, I don't think that's what you mean, I think you mean that this was a problematic experience. If that's right, I don't really understand that reaction: this is a nascent field, the ready-made experienced work-force is expected to be much smaller than the demand, with most positions filled by new entrants gaining experience rather than folks who already have it. This is the fundamental challenge and simultaneously the great opportunity of running a business in a nascent field!

> The people who have it easy are the ones with a MS or PHD and years of experience doing data science work at companies under their belt. There are very few of these people right now.

Are there enough of those people to meet the demand, even recognizing that it is smaller than the overall software engineering demand? If not, it seems like the parent's point still holds.

Edit: But thanks for the perspective on the size of the market and level of hype, that's a useful data point.

Every data scientist started out with no experience. This gatekeeping will have to come down if the industry actually wants data scientists who have experience.