| Bags of money are already being waved around, that is not the problem. Wages are already moving north of $200k for these positions because you can't find people with the basic skills for any amount of money. Being a "data scientist" as currently defined in practice requires someone to be a polymath with skills that are individually high value and not commonly found together. Roughly speaking, you need some aptitude and experience in the following areas: - mathematics, particularly statistics, computational geometry, machine learning, and probability theory - parallel algorithm design, something for which most software engineers have no skill - database ETL processes, formerly a highly specialized discipline only found in the database administration world You can learn the mathematics in school or with some study. Most software engineers never develop a knack for parallel algorithm design even when they try e.g. virtually all software engineers who claim to know parallel algorithms can't explain why hash joins do not parallelize well. Lastly, ETL is something that isn't normally found mixed with the other two but which usually requires some significant experience to do correctly. Even if you are a master of mathematics and parallel algorithms, ETL skills are something you usually learn by apprenticing with someone who is an ETL master for a couple years. Finding people that even have basic levels of skill at all three of these things is very difficult even if you loosen the criteria significantly. Unlike some other tech job fads, you can't mint a crop of data scientists in a year. When I look at the junior level data scientists we trained internally with great basic skills out of school, it has taken years to develop them. This level of effort and length of time is the real bottleneck. |
Incidentally, I would really like to hear about the kind of Real Work that data scientists end up doing with TBs of data, because I'm always fuzzy on the details. MCMC? Variational methods? SVMs? Or is it more oriented towards frequentist statistical methods, applied at "web-scale"?