Hacker News new | ask | show | jobs
by pmb 5164 days ago
"claims of severe talent shortage in Big Data http://online.wsj.com/article/SB1000142405270230472330457736... Ok... where are the high salaries (500k$ a year)? No? No real shortage."

https://twitter.com/#!/lemire/status/196245665951649793

Business has a shortage of "big data" folks in much the same way I have a "huge sailboat" shortage. Neither of us want to pay for it. We want it, but not for the going rate. Only one of us has a media platform, though.

3 comments

The salaries are already moving north of $200k even outside of Silicon Valley and New York City and getting more expensive by the month. How high do they have to be before we have a "shortage"? The problem is not lack of money, it is that demand has greatly outstripped a finite supply.

Very high wages do not automagically create new people with the requisite skills and this is the real bottleneck. It takes significant aptitude and years of training/experience to become useful as a "data scientist". It is not as easy as I think people are imagining. We train people with excellent raw skills where I work, usually strong applied mathematics backgrounds with natural programming skills. It is much easier than trying to find someone outside with these skills, though we do attempt outside recruitment. It still takes years to develop the people we train into a good, basic data scientist.

Finally, some basic labor market economics. Just like how employers have restrictions on the number of people they are able to hire, employees have restrictions on the number of hours they are able to work. Or for the sake of this example, whether they are able to be a data scientist or not. The aspiring data scientists' restrictions all have to do with their ability. And as you point out, the cost of human capital investment in this area is very, very high.

One doesn't need to be an econometric theorist to be a data scientist, but I feel that far too few hackers truly appreciate the elegance of some standard, say, 1st year economics grad school econometric models. Things like panel models, IVs, 2SLS, GMM and even just taking seriously the basic assumptions of OLS regression -- there's a reason most econometrics classes (grad or undergrad) always start with the ~5 assumptions of OLS regression.

TL;DR Economists would make great data scientists (and better economists) if only they understood and appreciated computer science more.

In context: $200k is roughly what attorneys are paid in their 5th-6th years at large firms that service large corporations. Lawyers are in surplus right now.

In finance, talented individuals are routinely paid well multiples of $200k for their work (even post-crash).

So while $200k is high for salaries generally, it certainly is not high enough to imply a shortage in a highly specialized field.

Look, this job title is at most 2 years old. How can someone have years of experience in this? OTOH, there are plenty of people with strong applied math and good programming skills.
The set of skills existed before it had a trendy job title so you can have the experience even if it was called something else. This is true of most of the people currently working as data scientists. In a similar vein, I was designing big data systems years before "big data" became a term or trendy. For any particular odd skill mix you can come up with, there are people with that skill mix who are already doing a similar job. But usually people do not intentionally build that skill mix until it becomes an official job title and career path in the eyes of the public so it is a very small pool of people.

In the case of modern data scientists, having strong applied mathematics and programming skills is about halfway to where you need to be and a good starting point. The demand has temporarily grown much faster than the convertible talent pool can develop the additional set of skills required.

I am of opinion that if demand is high enough, companies will start hiring "halfway there" people. But this will happen only if the market grows big enough. Right now it is still a niche market where companies are cherry-picking right candidates, it seems. At least this is the impression I get from reading this thread.

The question of the size of the market is crucial. Small labor markets are very inefficient. This means that the number of qualified people is small enough, but the number of companies they can choose from is also small. It is hard to find a job when the number of companies hiring is probably less than 100.

Even more,

I don't notice an effort to expand the workforce by training or by recruitment of non-traditional workers, etc.

The contra-logical statement "99% of programming applicants are unqualified" gets a lot of play in this field. But I would suggest something like "we can make 99% of applicants look like idiots with our circus-like hiring process".

Yes, we've decided we have a shortage once we decide on five arbitrary disqualifications, expect all applicants to work 18 hours a day and start yesterday having no time to get up to speed (so experience on earlier large systems, say, is indeed not useful).

The base-level skill set is being a very good applied mathematician with some good computer science skills. This is why a lot of "data scientist" types have degrees in things like physics. A lot of the database ETL stuff can be learned.

This is the reason why I cannot be a "data scientist", despite being an expert in parallel algorithm design and with strong database ETL experience. It would require me spending a couple years studying mathematics in depth that I do not currently know. The vast majority of programmers are at least as deficient as I am in critical skills for these positions.

We train our data scientists at my company but we usually do not start with software engineers. Our feedstock is strong applied mathematicians with some programming skills because the mathematics part is by far the most difficult to train for someone who has not already been doing it for years.

This is the reason why I cannot be a "data scientist"

Are you worried about this outcome at all? Do you see yourself playing an important role on a data team, one with less modeling responsibilities but more infrastructure/DB responsibilities?

I'm considering this path and would love to hear your opinion.

To be clear, I chose this outcome. I am good with mathematics but not the mathematics usually needed as a data scientist and I have relatively little interest in investing the time to learn. Being a data scientist is a great job for some people but probably not what I would choose even if I was a developer again.

There is a continuum of skill balances; some people are more "data" than "scientist" and vice versa. The most useful balance varies from job to job. There are plenty of opportunities for people that have strong skills standing up clusters even if you have relatively weak analysis and model building skills. I would not dissuade anyone from becoming a data scientist, it will pay very well for the foreseeable future, but the skill set requires real effort to acquire. At a small company there is likely opportunity to learn the trade by coming at it from the infrastructure side of things.

It is a young enough area that it should be pretty easy for talented individuals to invent a career if they apply themselves.

Thanks for the excellent summary. This line:

I am good with mathematics but not the mathematics usually needed as a data scientist

resonates with me. If you're thinking of data science, you're facing a loooooong road of coursework (scientific computing or numerical methods, linear algebra, PGMs, machine learning, AI, possibly some optimization too) to get your foot in the door. I'm going to try, but one could spend years finishing that work.

In some ways, getting a data science gig is the opposite of getting a web developer gig. In DS you're competing with a large supply of intelligent PhDs, so credentials are very important; for web dev, your portfolio goes much further than any credentials.

That's a very good point: insisting that a single person must have skills in math, computer science and data interpretation is creating a purely arbitrary set of qualifications.

If we make a comparison with other fields, we can see that trying to find a single person who has skills in several diverse areas is not something that's usually done. For example, do companies try to hire bond traders who can implement their own trading software? Or do we insist that airline pilots or surgeons or CEOs should be able to build and repair the technology they use?

And if a company did manage to find a person who was both a good statistician and a good software developer, wouldn't the combination of responsibilities pull this person in too many directions, making it hard to focus on on what they were doing? Also, it would take a lot of effort to stay current with the latest developments in both math and computer science.

If a company was too small to be able to afford to hire three full-time specialists to analyze their data, they could outsource their data crunching needs to consulting companies that specialized in this kind of work.

I also have a problem with the newly-coined term "data science". Scientists are engaged in discovering fundamental truths about the way the physical world works. I don't think that finding trends in a company's data counts as science. (I don't think that 99% of "computer scientists" are scientists either, including the professors I knew in grad school.) I liked the older term "data mining" much better, but I guess it's not trendy enough anymore.

I don't notice an effort to expand the workforce by training or by recruitment of non-traditional workers, etc.

"Code Year" for data scientists would actually need:

-- "Code Previous Year," where everyone would need to kick ass on Bayesian inference, linear algebra, and production-level software development; then,

-- during "Code Year," they'd proceed to learn about distributed algorithms, graphical models, and HMMs, then learn about distributed frameworks like Hadoop.

See Joseph Misti's comment here for more--it's really the most accurate list of skills one needs to become a data scientist:

http://www.quora.com/What-skills-are-needed-for-machine-lear...

>We want it, but not for the going rate.

Does it cost 500k$/year for someone who got in that field to be in net positive ? I mean I know that people in US are complaining about high cost of higher education - but 500k$/year for it to be viable career path ? I think saying there is a shortage is justified, if the salaries are decent (and from anecdotal evidence I know that they are) people should be made aware that this field might be worth entering and that they should look in to it.

pmb is (correctly) saying that there is, by definition, no shortage of big data people. There's just a shortage at the (obviously below market clearing) price employers wish to pay. Also, you're ignoring the steep lead time to become a deep expert in stats / ML -- most likely a PhD plus significant programming time plus work experience.
>by definition, no shortage of big data people

Shortage is defined by price being above the market equilibrium, we can debate what the equilibrium is but I think even at the current wages people should be interested in getting in to this field (I know a friend who is doing postgrad in math and interning for BI because the prospects are great), it's just that they can't get in fast enough - therefore IMO there is a temporary shortage.

>most likely a PhD plus significant programming time plus work experience.

Is it normal to expect 500k$/year for that experience in some other field ? I would sure like to know, maybe I can still switch :) I mean I know there are people making that kind of money but it can't be the average for PhD with work expirience ?

Of course, by that definition, there is never a shortage of anything.
This is a very good quersion, and made me think a little. Here's my stab at it:

If we define the shortage as "shortage of people willing to do X for $200,000 a year", that's clearly a bad definition. You should just pay more (as earl suggested) to get what you want. But what if that's just not possible on a macro level?

Consider if you have an aggregate demand of "the market needs a total of 500 Data scientists". If there are only 250 data scientists in the world, their salaries will be bid up, then I can see somebody crying that there's a shortage for affordable data scientists (whatever that means). But any capitalist will tell you that they're just looking for a free sailboat. On the other hand, the 250 data scientists are being paid a lot, and on some level skills are somewhat fungible, so you end up with non-data scientists (maybe vanilla statisticians/actuaries) moving sideways to get in on this payday. So you have some retraining, and in the long run things tend to work out. So there's no shortage.

But in the long run we are all dead. Thus even if we define a shortage as the shortfall in supply at _any_ price, this isn't sufficient. In the short run there can very well be a shortfall.

If it takes 3 years to train a data scientist (I'm just making stuff up here), and there are 250 data scientists on the market, if you have an aggregate demand for 500 data scientists _today_ --- completely price unconditional --- you just cannot fill it. Price is almost irrelevant (on the macro level --- you as an individual can always outbid your competitors). There is a temporal shortage that cannot be filled.

I think this is the precise definition of what a shortage is that you are looking for. Shortages exist in the macro scale. Shortages do not exist for individual companies (they should just pay more, and if the benefits are not worth the cost of hiring, there isn't a shortage, they're just cheap).

Well, I applaud your effort but you've missed things.

The market doesn't need anything, human beings need and desire things and, in the capitalist model, the market is the means to balance those needs and desires things.

If 50 entrepreneurs desire 500 data scientists for their enterprises and only 250 fifty such scientists are on offer, they'll bid up the prices until some of them decide "I don't want them that much" and then we're done.

Of course, if our 50 entrepreneurs all have vast, vast wealth and a very strong desire for those scientists, we may see them creating bootcamps for quick data scientist development or whatever. Then they might indeed wind-up another 250 data scientists and again, we're done with no mystery.

Of course, you could argue that markets aren't as efficient as some claim but that's just about irrelevant to our reasoning here since we're mostly reasoning by definitions and extreme and so the OP basically holds true - you say you want but you've shown you don't want it that much and so you're really blowing smoke...

By that definition, the only absolute shortage is in things that aren't available at any price.

There can still be a relative shortage. There's shortage of gold-per-pound relative to rocks-per-pound, for example