Hacker News new | ask | show | jobs
by mrharrison 3568 days ago
We should rename this job position to Data Sanity Engineers.

I have been thrown these projects at work before, where I'm the frontend engineer and I need to make some cool D3 visualization, but low behold the data is shit, and I have to help the backend team make the data useable. It's a mind-numbing job, that nobody wants, because it sounds like a one month task to get a good REST API up and working, but it usually takes three months, because you have to go back and forth making sure the data is right, and there is always 10 tricky edge cases that you have to work some magic on. Not only that but you need to have smart people cleaning the data, so that you don't make some big mistake down the line or your REST API is super slow, and you have to add another couple weeks or month to rework the data again. So that one month becomes three months, and most likely a year, because somebody will say that looks great but can we also add this, and it goes on and on. It's literally a mind-numbing job that most nobody wants. I have found that products like Tableau are the best for this, you still have to clean the data, but it helps speed up the process.

Data cleaning is a super golden problem to solve.

4 comments

As a contradiction to this point, some people (me) really enjoy working with data, from cleaning, munging, creating, sorting, pipelining, etc, and find front-end visualization production excessively boring and mind-numbing.

Give me emacs and a command line, and I have all the truth I need, which is far more honest, in my mind, than anything that can be created with D3 or Tableau. Beauty is in the eye of the beholder, and it doesn't really do anyone service to look down on the work others find enjoyable. If doing D3 makes you happy, that is awesome, and I can only congratulate you for your passion and your ability to look forward to work I don't "get," and I wish the feelings would be mutual.

So I guess you are a data engineer? What makes it fun for you? How do work with your customers to give them what they need in a timely matter? I would be interested to know what stack you use to go from dirty data to customer consumption.
Closer to an aspiring data engineer, though I've done my fair share of ETL, cleaning, database building / rebuilding, admin. Prior jobs have been database engineer, probably closer to DBA.

I just enjoy working with raw data and raw code more than I enjoy writing something that launches a graphic. I enjoy writing a script that finds a bad piece of data, or a script that fixes up everything, or writing something that was once unable to run at all get converted to something that runs in 500ms. Perhaps it is that journey of constant discovery, and seeing that every situation is a unique little puzzle. It is seeing the world as it is with no one reinterpreting what the data means for me. I can explore it and discover what it really means. It is hollow truth, a mess of ideas converted to sets of ideas layered on sets of ideas, and when it is finally drawn down, converted, and passing all tests, it is self-evident and self-reflecting, and true. Hard to explain, but I suppose I like all the things people hate about it.

The tools matter about as much as it matters what CSS framework you are using. You have the ability to logic through UI and UX, whereas I do not. I have zero hope of ever doing well at what you do, since I simply don't have the foundation, but if it matters, I know most jobs I've applied to and worked at tend to be more ad hoc, using PL, Python, Ruby, etc.

I'm not comparing frontend to backend. I also think data is fun and I don't mean to be little the job, but in a real world scenario its detail intensive, under appreciated, tons of edge cases and extremely complex if you plan to make it scalable and fast. So if you are an aspiring data engineer be aware of these pitfalls, because the first couple times you do it you will think its fun to try something new and create some fun useful analytics, but customers will often complain at how long it takes and want more. It starts to wear away at ones drive and passion for data. Its not the data aspect its the job/deadline aspect.
You're getting very close to the root cause - customers and even colleagues don't really care about the work that goes into the data. They care about the end deliverable, because that's what creates value for them, and fairly so. That gets at why data engineering as a discipline isn't (IMHO) very well respected.

I know this isn't reddit, so I'll point you to reddit. Check out /r/datascience where those folks talk about what it takes to be a data scientist. Some folks are honest about data engineering, but most handwave past it, or talk about it like it's beneath them. Their role would not be possible without solid data engineering, rather than a complementary and equally important discipline. Good luck doing "data science" or "analytics" or "machine learning" or every other buzzword without clean data, and for us data engineers, good luck ever demonstrating value without the analytics folks working with us.

There's nothing aspiring about what you wrote. I think you're fine calling yourself a data engineer if those are the types of challenges you've been solving.

Don't sell yourself short or select yourself out of an opportunity (within reason). That's someone else's job!

    sed -i 's/emacs/sublime-text/g' what_u_said.txt
more like Ctrl-H, tab, 'emacs', tab, 'sublime-text', tab, enter, esc, Ctrl-S
you are right that is more coherent.
Not only that but you need to have smart people cleaning the data,

Which are difficult to find when you think of them as "janitors", and treat them accordingly.

Data Sanitation Engineers
I do it for a living. It seems underappreciated in the industry.
I agree. I enjoyed doing it the first couple times, but people would often complain why I wasn't done sooner and didn't appreciate the level of complexity that went in to doing it. Once the appreciation was gone, I believe that's when it turned into a mind-numbing task for me. I don't mean to be little the job, I think I have just become sour to it because of the lack of appreciation.
A big part of my role is getting out there in front of business partners to keep the things that we do well front of mind. If you manage this work in the traditional sense, you'll be invisible when things go well and shat-upon as soon as anything goes wrong. At my current organization, I've really had to work at this. Here's a story:

Once upon a time I managed (and, frankly, also wrote a lot of the code for) a project integrating half a dozen sources each managing a block of our business (billing, coverage, claims). The data was awful coming in and we managed to get a bunch of business processes changed in addition to some pretty heavy cleansing steps that we wrote. In any case, this big fragmented mess of monthly and weekly stacked data became my integrated, clean warehouse. For the first time ever at this organization, I had coverage and claims records tying up at a rate of 100% without any manual intervention. We did this so that we could implement a modern finance ops process on top (being intentionally vague) that would allow us to manage this block more efficiently, save time, and even let us better invest - it was a 2 year project including my data work. A handful of actuaries and analysts got promoted out of this as it was a BFD to the company. Yet, at the end of the year, when I got my review I got our equivalent of the average rating, 3 of 5, etc, and like a 3% raise, and a shitty budget for my people too. From then on, I spent almost as much time out there promoting our team's work as we did doing the work. We did considerably better the next year, and that's been the way I've operated ever since. I market the work.

This kind of work requires a manager who will actively market it within the organization.