| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yummyfajitas 5164 days ago

Companies already do pay close to $200k/year for entry level data scientists.

(what other kind of scientists are there? The tea-leaf reading kind?)

"Data scientist" refers to the guy who can set up a hadoop cluster, do statistics on TBs worth of data, derive useful conclusions and speed it up by tweaking the low level data formats or microoptimizing the calculation.

The issue is rarely paying these guys an extra $20k, it's simply finding them.

Setting up some lasers and a photonic crystal, imaging the output, making a graph in excel or matlab and drawing conclusions is a different skillset. Someone who can do the latter is a scientist who uses data, but he is not a data scientist.

4 comments

bearmf 5164 days ago

The problem as I see it is that most companies are looking for the all-in-one perfect candidate. There is indeed a shortage of such people.

Say you need someone who knows a lot about Hadoop and Amazon EC and is also intimately familiar with most learning algorithms and has a PhD. You are having trouble finding the guy. You start crying about "the big data talent shortage".

And here is the problem. Most PhDs have no experience with Hadoop or Amazon EC. Some of them might know Java well enough.

Now, consider a smart guy with PhDwho knows Java and has done something parallel with it, working on real "dirty" data. He can pick up Hadoop in no time from your software engineers. He will learn to tweak and optimize in his time - it is domain specific and cannot be learned off the job.

Will he be hired? Probably not. But people will keep crying about shortage.

Tichy 5164 days ago

How hard can it be, though? Like taking a normal CS person and making them versatile with hadoop and so on? Could it be done for 20K$?

yummyfajitas 5164 days ago

Making a CS person versatile with hadoop is not that hard. Making a CS person versatile with statistics is much harder.

See Zed Shaw's seminal article "Programmers Need To Learn Statistics Or I Will Kill Them All".

http://www.zedshaw.com/essays/programmer_stats.html

Making a math/science person versatile in CS is somewhat easier, but even that can be tricky. Many of them are bored by file formats, architecture, etc, and simply don't have the mindset of of engineering.

achompas 5164 days ago

How hard can it be?

Very hard. You run into all types of candidates who just aren't there yet: people working on research that's irrelevant to real world applications, people who have done data analysis/BI work that brand themselves as "data scientists," those who have the pedigree but cannot process and explore real-world data, those who have good analytical chops but not the distributed or advanced modeling experience, etc.

I've witnessed it first-hand, and it's tough to find the right person.

bearmf 5164 days ago

If it is that hard the bar is probably set too high. Most of the skills are learned on the job after all. Most smart PhDs who can program well and have sound knowledge of statistics can learn to do this stuff.

achompas 5164 days ago

Given enough time, anyone smart enough to finish a PhD can acquire a set of skills. :)

But it's more than just solid statistics. We're talking about having enough mathematical fluency to develop models rigorously (not just "oh, we'll minimize MSE!!"), test those models, then implement those models--possibly using a distributed algorithm.

From what I hear, these skills take years to develop. Choosing to groom the wrong person is an extremely costly mistake, so making the choice is difficult.

bearmf 5164 days ago

All mathematics consists of rigorous models. But choosing and tweaking a model is more of an art. Most data scientists apply existing models to new data, they do not develop new ones.

I am sure it takes much less than "years" for any smart PhD in applied mathematics to learn most of data analysis tricks. It is not theoretical physics after all.

achompas 5164 days ago

Most data scientists apply existing models to new data, they do not develop new ones.

I meant "develop" in the software sense. Data scientists use off-the-shelf libraries during initial research, but those libraries usually lack an important feature preventing them from going into production (typically, no support for concurrency).

I am sure it takes much less than "years" ... to learn most of data analysis tricks.

I used to be cynical about "data science," too. After four months of working on a data science team, though, I'm a believer.

A data scientist is really a "full-stack data developer." He or she needs the ability to work with advanced models, use them to analyze large amounts of data, and modify those models to work concurrently or in a distributed system if desired (and its often desired). It's more than just "analysis tricks."

pnathan 5164 days ago

> do statistics on TBs worth of data, derive useful conclusions

That's gonna be the hard part. Most CS people I've met flee from math and, more generically, theory.

groth 5164 days ago

They do? Which companies? How do I find them? :p

yummyfajitas 5164 days ago

Build a demo project showing you can do data analysis and they will find you.

earl 5164 days ago

Who is paying $200k for entry besides maybe google?

anothermachine 5164 days ago

Startups.*

*Equity value, may vary unpredictably.

And it's entry post-PhD, not entry from college.

earl 5164 days ago

So $120k and (very expensive) lottery tickets is what you're saying =P

reinhardt 5164 days ago

$120K as a startup employee? Damn, I live in a wrong country.