Hacker News new | ask | show | jobs
by llambda 1416 days ago
To address a point the author makes: I’m entirely unconvinced the “shift left” mentality of data democracy (aka business operators should write sql) is actually shifting left or a worthy path to pursue for most businesses. More recently this 2010s fad seems to be dying and in favor we’re seeing centralized data efforts that produce data products.

One of the most significant pitfalls of data is failing to interrogate the value it provides and assuming that if you give everyone access all the time the magic will happen. The truth is value does not simply materialize just as value does not magically spring from computers by a human powering it on (okay sure, you may have already automated the value but that’s actually the point I’m about to make). In both cases it requires an experienced practitioner who collaborates with a larger team to intersect their work with the business needs.

Data is tricky, all the more so because it’s often seen as a panacea by business leaders who aren’t connected with the work of extracting that value.

4 comments

With all credit due to Google's excellent and under-appreciated paper Machine Learning: The High Interest Credit Card of Technical Debt [1], I submit that Big Data is the high interest home equity line of credit of business operations debt.

It's not that big data tools aren't useful. It's that, when you just start amassing huge piles of data without a clear up-front plan for how it will be used, and assume that a whole bunch of people who have never heard of sampling bias or multiple comparisons bias or Coase's Law [2] can figure out what to do with it later, you're setting yourself up for a Bad Time.

  1: https://research.google/pubs/pub43146/ 
  2: "If you torture the data long enough, it will confess."
I'd say that Big Data is the Collateralized Debt Obligations of business operations. It looks fabulous from afar but it can blow things up quickly if there's no understanding of the internals.
Yet, we abide by data-oriented conclusions outside of software engineering all the time. From Academics papers to FDA to crime statistics.
I won't say any of those are perfect. But there's at least a little more effort toward responsible data analysis in academia. The FDA brings an interesting example to mind. Take a look at how, on paper, drugs suddenly magically became less effective when the FDA started requiring clinical trial pre-registration in 2007.

It's also worth noting that, over the past few decades, most academic fields have been getting increasingly skeptical of the value of correlative research on pre-existing data sets. Even among people who have been extensively trained in how to do it properly. And yet, the vast majority of big data business plans I've seen in practice boil down to "collect a huge data set and then let people do correlative research on it."

Agreed, I want more scrutiny than some entity flashing “Here is the data”. It can easily be exploited behind the veneer of data-based-credibility.
> I submit that Big Data is the high interest home equity line of credit of business operations debt.

I like this but it's kinda like the payday loan of business operations.

>that if you give everyone access all the time the magic will happen

There's much ongoing discussion about this is the data world, often revolving around "self-service analytics".

Unless you're talking about "our analysts don't have to clean data all the time", which, for a large enough organization makes sense, "self-service" for non-technical folks is futile and pointless. They need specific answers to specific questions, not the ability to infinitely explore the data. Organizations should desire that kind of focus, not prevent it.

They idea was that they were going to hire an army of data scientists and become google...magically.

Reality smacked that shit down hard. I left data engineering because the projects were all over the place, wildly undisciplined and unfocused.

You were lucky to have source control let alone an understanding from the business that these projects were in fact software development.

I switched back to software engineering because at least there is a faint realization that we are...building software.

I might go back when the dust clears.

"Why do we need to hire programmers...I thought we needed data engineers?"

"Because the data pipelines are all built with thousands of lines of code. Java, python, Fortran, you name it...and your job post only mentioned SQL and data modelling"

I could go on forever.

This is the constant argument I have with people about data products.

You don't need to expose more dimensions or get the users more access to the raw data. You need to understand what their business is and what their business problems are and help them answer those specific questions quickly and succinctly.

Yes, there are certainly times where people use huge amounts of raw data to uncover the answer to a question they didn't know they had. But it's rare, it's expensive to support, and most businesses are going to be able to do anything with it anyway (a whole org built to do X isn't suddenly going to shift to do Y because you discovered some insight in a random report).

I've seen data errors because of joins and aggregations. Data democratization can be a net negative, especially if people don't question the graphs they see.
Do you know what the author means by 'left' here? Probably not moving bits around in a way that equivalent to multiplying by powers of 2?