Hacker News new | ask | show | jobs
by jandrewrogers 5171 days ago
Bags of money are already being waved around, that is not the problem. Wages are already moving north of $200k for these positions because you can't find people with the basic skills for any amount of money.

Being a "data scientist" as currently defined in practice requires someone to be a polymath with skills that are individually high value and not commonly found together. Roughly speaking, you need some aptitude and experience in the following areas:

- mathematics, particularly statistics, computational geometry, machine learning, and probability theory

- parallel algorithm design, something for which most software engineers have no skill

- database ETL processes, formerly a highly specialized discipline only found in the database administration world

You can learn the mathematics in school or with some study. Most software engineers never develop a knack for parallel algorithm design even when they try e.g. virtually all software engineers who claim to know parallel algorithms can't explain why hash joins do not parallelize well. Lastly, ETL is something that isn't normally found mixed with the other two but which usually requires some significant experience to do correctly. Even if you are a master of mathematics and parallel algorithms, ETL skills are something you usually learn by apprenticing with someone who is an ETL master for a couple years.

Finding people that even have basic levels of skill at all three of these things is very difficult even if you loosen the criteria significantly. Unlike some other tech job fads, you can't mint a crop of data scientists in a year.

When I look at the junior level data scientists we trained internally with great basic skills out of school, it has taken years to develop them. This level of effort and length of time is the real bottleneck.

1 comments

Computational geometry??? That's a new one for me. Do you mean only linear/convex programming?

Incidentally, I would really like to hear about the kind of Real Work that data scientists end up doing with TBs of data, because I'm always fuzzy on the details. MCMC? Variational methods? SVMs? Or is it more oriented towards frequentist statistical methods, applied at "web-scale"?

I mean actual computational geometry. Reality is significantly non-Euclidean in complicated ways that have to be accounted for if precision matters.

Spatio-temporal analytics or the processing of sensing data frequently requires this. For a simple example, the surface of the Earth is approximately an oblate spheroidal surface, not even a 2-sphere. You can use Euclidean approximations for many cartographic purposes but for analytics this can introduce large errors in the analysis. Understanding how to compute non-Euclidean geometry models is surprisingly useful.