Hacker News new | ask | show | jobs
by moandcompany 3576 days ago
I am a data engineer working on a machine learning team with models actively used as part of our product(s).

From my experiences working in various contexts (applied machine learning, analytics, policy research, academics, etc...), there are several of factors that contribute to this shortage: (1) "data engineering" often requires a lot of breadth and knowledge, (2) "data engineering" is often (derisively and naively) referred to as the "janitorial work" of data science, (3) the spectrum of roles and requirements within the "data engineering" domain, in terms of job descriptions, can range from database systems administration, to ETL, to data warehousing, curation of data services / APIs, business intelligence, to the design/deployment/operation of pipelines and distributed data processing and storage systems (these aren't mutually exclusive, but often job descriptions fall into one of these stovepipes).

Some of my quick thoughts and anecdata:

Companies have made large investments in creating 'data science' teams, and many of those companies have trouble realizing value from those investments.

A part of this stems from investments and teams with no tangible vision of how that team will generate value. And there are several other contributing factors…

"Dirty work." People haven't learned how to, and more often don't want to do it. There's a vast number of tutorials and boot camps out there that teach newcomers how to "learn data science" with clean datasets -- this is ideal for learning those basics, but the real world usually does not have clean or ideal datasets -- the dataset may not even exist -- and there are a number of non-ideal constraints.

There are people that wish to call themselves “data scientists” that “don’t want to write code” and would “prefer to do the analysis and storytelling”

Engineering as the application of science with real world constraints: there are a number of factors that we take into account, often acquired through painful experience, that aren’t part of these tutorials, bootcamps, or academic environments.

Many “data scientists” I’ve met have a hard time adapting to and working with these constraints (e.g. we believe that the application of data science would solve/address __ problem, but: how do we know and show that it works and is useful? what are the dependencies, and costs of developing and applying that solution? is it a one-time solution, or is it going to be a recurring application? does the solution require people? who will use it? what are the assumptions or expectations of those operators and users? is it suitable? is it maintainable? is it sustainable? how long will it take? what are the risks involved and how do we manage them? is it re-usable, and can we amortize its costs over time? is it worth doing? This is part of a methodology that comes from experience, versus what is taught in data science)

Larger teams with more people/financial/political resources can specialize and take advantage of these divisions of labor, which helps recognize the process aspects of applying data science and address some of the above

Short story: if you view data engineering as "janitorial work" you're missing the big picture

Anyone else notice that the attributes of a 'unicorn' data scientist include the traits of a 'data engineer?'

2 comments

How does one get started with this? I suppose a lot of people who hang out at HN are competent devs good in programming and databases, but probably beginners in math, ML, AI etc. How does such a person get started and find a job in this field?
in my mind the problem is really simple: most executives aren't smart enough to understand how any of this shit works, or build a compelling business case around it. they just know they need a 'big data' team, so it just dies on the vine.

someone with enough smarts to build/lead a team, sell to executive management, and have an actual business application is just too rare compared to the prevalence of the engineering talent.