Hacker News new | ask | show | jobs
by tomjen3 5165 days ago
Actually that is silly -- McKensey should now that there is and will never be a talent shortage. There will only be shortage of talent at a particular wage rate.

If the companies paid newly graduated 'data-scientists' (what other kind of scientists are there? The tea-leaf reading kind?) 200k/year then they would have a lot more. It is pretty simple economics.

4 comments

Companies already do pay close to $200k/year for entry level data scientists.

(what other kind of scientists are there? The tea-leaf reading kind?)

"Data scientist" refers to the guy who can set up a hadoop cluster, do statistics on TBs worth of data, derive useful conclusions and speed it up by tweaking the low level data formats or microoptimizing the calculation.

The issue is rarely paying these guys an extra $20k, it's simply finding them.

Setting up some lasers and a photonic crystal, imaging the output, making a graph in excel or matlab and drawing conclusions is a different skillset. Someone who can do the latter is a scientist who uses data, but he is not a data scientist.

The problem as I see it is that most companies are looking for the all-in-one perfect candidate. There is indeed a shortage of such people.

Say you need someone who knows a lot about Hadoop and Amazon EC and is also intimately familiar with most learning algorithms and has a PhD. You are having trouble finding the guy. You start crying about "the big data talent shortage".

And here is the problem. Most PhDs have no experience with Hadoop or Amazon EC. Some of them might know Java well enough.

Now, consider a smart guy with PhDwho knows Java and has done something parallel with it, working on real "dirty" data. He can pick up Hadoop in no time from your software engineers. He will learn to tweak and optimize in his time - it is domain specific and cannot be learned off the job.

Will he be hired? Probably not. But people will keep crying about shortage.

How hard can it be, though? Like taking a normal CS person and making them versatile with hadoop and so on? Could it be done for 20K$?
Making a CS person versatile with hadoop is not that hard. Making a CS person versatile with statistics is much harder.

See Zed Shaw's seminal article "Programmers Need To Learn Statistics Or I Will Kill Them All".

http://www.zedshaw.com/essays/programmer_stats.html

Making a math/science person versatile in CS is somewhat easier, but even that can be tricky. Many of them are bored by file formats, architecture, etc, and simply don't have the mindset of of engineering.

How hard can it be?

Very hard. You run into all types of candidates who just aren't there yet: people working on research that's irrelevant to real world applications, people who have done data analysis/BI work that brand themselves as "data scientists," those who have the pedigree but cannot process and explore real-world data, those who have good analytical chops but not the distributed or advanced modeling experience, etc.

I've witnessed it first-hand, and it's tough to find the right person.

If it is that hard the bar is probably set too high. Most of the skills are learned on the job after all. Most smart PhDs who can program well and have sound knowledge of statistics can learn to do this stuff.
Given enough time, anyone smart enough to finish a PhD can acquire a set of skills. :)

But it's more than just solid statistics. We're talking about having enough mathematical fluency to develop models rigorously (not just "oh, we'll minimize MSE!!"), test those models, then implement those models--possibly using a distributed algorithm.

From what I hear, these skills take years to develop. Choosing to groom the wrong person is an extremely costly mistake, so making the choice is difficult.

All mathematics consists of rigorous models. But choosing and tweaking a model is more of an art. Most data scientists apply existing models to new data, they do not develop new ones.

I am sure it takes much less than "years" for any smart PhD in applied mathematics to learn most of data analysis tricks. It is not theoretical physics after all.

> do statistics on TBs worth of data, derive useful conclusions

That's gonna be the hard part. Most CS people I've met flee from math and, more generically, theory.

They do? Which companies? How do I find them? :p
Build a demo project showing you can do data analysis and they will find you.
Who is paying $200k for entry besides maybe google?
Startups.*

*Equity value, may vary unpredictably.

And it's entry post-PhD, not entry from college.

So $120k and (very expensive) lottery tickets is what you're saying =P
$120K as a startup employee? Damn, I live in a wrong country.
Bags of money are already being waved around, that is not the problem. Wages are already moving north of $200k for these positions because you can't find people with the basic skills for any amount of money.

Being a "data scientist" as currently defined in practice requires someone to be a polymath with skills that are individually high value and not commonly found together. Roughly speaking, you need some aptitude and experience in the following areas:

- mathematics, particularly statistics, computational geometry, machine learning, and probability theory

- parallel algorithm design, something for which most software engineers have no skill

- database ETL processes, formerly a highly specialized discipline only found in the database administration world

You can learn the mathematics in school or with some study. Most software engineers never develop a knack for parallel algorithm design even when they try e.g. virtually all software engineers who claim to know parallel algorithms can't explain why hash joins do not parallelize well. Lastly, ETL is something that isn't normally found mixed with the other two but which usually requires some significant experience to do correctly. Even if you are a master of mathematics and parallel algorithms, ETL skills are something you usually learn by apprenticing with someone who is an ETL master for a couple years.

Finding people that even have basic levels of skill at all three of these things is very difficult even if you loosen the criteria significantly. Unlike some other tech job fads, you can't mint a crop of data scientists in a year.

When I look at the junior level data scientists we trained internally with great basic skills out of school, it has taken years to develop them. This level of effort and length of time is the real bottleneck.

Computational geometry??? That's a new one for me. Do you mean only linear/convex programming?

Incidentally, I would really like to hear about the kind of Real Work that data scientists end up doing with TBs of data, because I'm always fuzzy on the details. MCMC? Variational methods? SVMs? Or is it more oriented towards frequentist statistical methods, applied at "web-scale"?

I mean actual computational geometry. Reality is significantly non-Euclidean in complicated ways that have to be accounted for if precision matters.

Spatio-temporal analytics or the processing of sensing data frequently requires this. For a simple example, the surface of the Earth is approximately an oblate spheroidal surface, not even a 2-sphere. You can use Euclidean approximations for many cartographic purposes but for analytics this can introduce large errors in the analysis. Understanding how to compute non-Euclidean geometry models is surprisingly useful.

Isn't that point of view also colored by ideology? Even supposing there are enough people capable of becoming that kind of "talent", what if those talents are also sought for in other kinds of jobs?

Granted, if it were really urgent, perhaps companies would start looking in the most remote places for talents, so with a population of 6 billion perhaps there really would be enough who could be trained. How many of those 6 billions are "free" in a sense, as in not needed for maintenance of human life (farming, medicine, building shelter and so on)?

But do economics really work that way? Could we extrapolate that logic to conclude that there is no problem in the world at all? All it takes is enough money to solve every problem - alas, the money doesn't seem to be there, or allocating it properly is apparently hard. (Hm, some of those talents might be able to help, for a true bootstrap solution).

It's not that you can solve every problem, but supply and demand are very real. If the wage rates for Big Data get high enough more and more people will try out the field, which will generate a larger supply.

The reasons companies don't just throw out huge salaries though has to do with the demand side. The salaries companies are willing to pay is related to the marginal advantage they can gain from hiring someone with that skillset. If for example a company will gain say 200k per year in total advantage, that would place a hard cap on how much they would be willing to pay in salary.

So if the advantage is very high, companies will pay more. If the supply increases sufficiently wage rates will drop because there is over supply. If the supply doesn't increase enough, wages will increase more - however each company will drop out at it's own value point. This provides the natural limit to where most salaries cap out.

If those talents are also sought after for other jobs then the price will go up until one of the jobs will be done by some other method or some other person. I do have trouble imagining that anybody who is working on a farm would be a good data-scientist but then I no very, very little about farming.

Economics is not tainted or colored by ideology, it is a science. It is the study of how best to allocate limited resources that have multiple conflicting uses.

In this world there is nothing that is free, everything comes with some price. As long as there is a human want that is not fulfilled, there is no additional humans.

That isn't necessarily bad though. You can charge societies progress to how few people are required to provide food to the rest. Once most Americans worked in argriculture, now only a few do. That is a good thing, because the rest of us can the do something else and satisfy some other human want.

And the remaining farmers are better of too, since they don't have to work as hard and have things like tvs and computers.

> Economics is not tainted or colored by ideology,

You must be joking. Economics is the most ideological of the sciences, because so much of it cannot be tested in nature or a lab, as it only can be tested an impractically large scale and with many confounding factors.

You're absolutely right. That said, I don't buy that big-data is as revolutionary as the Internet. While in theory, every single business can collect data and optimize based on what they see, this is way too complex for most businesses to deal with. While big data has certainly been critical for the business model of ad-based startups, I don't see it being used in other industries. People keep alluding to data-driven medicine and genetic analyses. These are some of the most complex information analyses domains, and yet, I don't see benefit commensurate with the big data hype. I'd love to hear counter examples though!
This technology will change the world more than the Internet or any other technology in human history.

You're right that adoption is very slow. I'm convinced that businesses could save trillions of dollars by applying existing weak AI to their problems. Why aren't they doing it ? For one thing there's a huge gulf between the average business person's understanding of what is possible and what actually is. On the other hand the people who understand the technology don't have domain experience in various businesses. You can't develop solutions if you don't know what the problems are and it's very hard to guess at what economically relevant problems exist in fields you've never worked in.

There are probably other barriers as well. Domain experts are unlikely to champion technologies that may, well, replace them. Bayesian networks were developed in academia 20 years ago that outperformed doctors at medical diagnosis. Why aren't they being applied ? There are probably many reasons but I suspect conscious or unconscious resistance on the part of the medical community plays a significant role.

As for a talent shortage, I don't buy it. I'm exactly the sort of person this article talks about, with a strong mathematical background, excellent implementation skills and real world experience in developing and applying machine learning algorithms that have made millions of dollars for my former employers. I have had a website and a LinkedIn profile for over a year that make this fairly clear. How many consulting inquiries have I had ? Exactly zero.

Whether or not you have these skills: potential employers need to SEE them. There are lots of pretenders out there, and employers are appropriately wary.

Are you showing employers results on your web page that a worse ML practitioner can't match.

Putting results from a kaggle competition on my LinkedIn page landed me my current job (and I am still contacted by potential employers every couple weeks).

The employers of the world aren't stupid, but they aren't omniscient either. So you need to make it easy for them to see that you have the skills you claim.

Did you win the competition ? If not what was your rank, if you don't mind sharing ? I'd be really interested to know how much impact this could have.
I was #4 at the time my current employer contacted me. I've continued in the competition with a small team of their employees, and we are currently #2.

I don't know how sensitive the number of job contacts is to your ranking. My hunch is that a lower ranking would still establish credibility.

I couldn't reply to Estragon's comment below, so I'm replying here.

I'm convinced of it primarily because I'm convinced that the majority of activities that human workers perform do not require general intelligence (ie. strong AI). This includes most manual tasks : cleaning, cooking, customer service in restaurants, etc, but also many tasks performed by office or even "knowledge" workers.

A product my previous company developed replaced over 20K workers over a period of years. Few people even know it exists.

Once you are aware of this and follow the news you see it happening in virtually all domains: e-discovery systems replacing attorney hours, automated news story generation, etc. One area that has been lagging is robotics but this will start to develop very quickly, especially with the new DARPA Robotics Challenge.

Now to be clear many of the "Big Data" applications people are talking about may not fall into this human labor replacement category, but the underlying technologies are essentially the same.

I have had a website and a LinkedIn profile for over a year that make this fairly clear

It doesn't work like that. You LinkedIn profile might easily land you any job in Software development, but not consulting.

In my opinion, if you want to do consulting for big corp. you should figure out what it takes to it. An attractive website and presentation, few buzzwords, client testimonials, business cards, and the other blablabla. Yes, it's irrelevant (and shitty) to what you are actually doing, but that's actually the world of consulting.

I think you can go further than that. To get people asking for your time as a consultant you have to demonstrate experience and get close to vendors who already support clients you are interested in. For example, targeting a niche "big data" problem with a particular tool, and then developing a relationship with the community supporting that tool. That gives you access to the people who are looking for consulting.
Big Corp - "We can't find people who can do X!!!!"

Person who can do X - "I've been saying over here I can do X."

Big Corp - "Oh. We don't look over there, it's not how it's done."

I think I'm seeing part of the issue here.

That's definitely not my world and one reason I left big-corp in the first place.

EDIT ADDED: It seems like a really broken market if buyer decisions are completely orthogonal to the product being purchased.

I hope you are right. If you are, you have identified a potentially very, very lucrative option for a start-up. Broken markets can provide you a lot of money when you fix them.
This market is already (at least partially) covered by small consulting shops which provide sales front for competent freelancers who don't feel like doing the whole corporate networking&sales ritual.

  I'm convinced that businesses could save trillions of
  dollars by applying existing weak AI to their problems.
Why are you convinced of this? Do you have any case studies with obvious broad applications?

  I have had a website and a LinkedIn profile for over a year that make
  this fairly clear. How many consulting inquiries have I had ? Exactly
  zero.
Neither LinkedIn nor your ISP are responsible for marketing you. Have you been presenting your software at restaurant conferences? Do you have any case studies where it saved someone x% of their food costs?
Slightly off-topic, but your website breaks after visiting the RDMS page, as all the other links seem to be relative, so they attempt to go to pages such as /products/people.html
Sorry about that, thanks for letting me know. It should be fixed now.