Hacker News new | ask | show | jobs
Cargo cult data science (blog.richardweiss.org)
144 points by riri-au 3242 days ago
10 comments

When the author says:

> However, that assumes that someone presenting an analytical presentation will be viewed more favourably than someone presenting something softer. Basically, I had assumed a data-driven culture exists, when in reality businesses are struggling to create that culture in the first place.

I think this understanding of the situation is in itself part of the problem. It assumes that someone coming in with an analytical presentation necessarily should be viewed more favorably than someone presenting something softer.

Coming from someone who's been working as a data scientist for several years, data-driven decision making has its limits. One very important one is a strong, strong bias towards myopic metrics (e.g. "engagement" over "lifetime value", "traffic volume" over "reputation in the market"), on the basis that they:

* Are more easily measurable

* Provide more data to work with

* Provide a stronger signal/noise ratio

* Provide much faster feedback

An organization which _always_ values data-driven decision making over expertise-driven decision making is always going to fall prey to this myopia. Fighting cargo-cult data science and building a sustainable analytical culture also means understanding the limits of data-driven decision making and that it does not replace, but supplements, "softer" expertise-driven culture.

> An organization which _always_ values data-driven decision making over expertise-driven decision making is always going to fall prey to this myopia.

This is a huge and under-appreciated concern. It's disturbing how often success is measured by optimizing a single metric, and how resistant people can be to recognizing issues with this approach.

A goal like "improve clickthrough rates" is easy to measure, but without some human insight it's all too easy to achieve it at the cost of overall success. Did you decrease time-on-page? Maybe your visitors feel mislead. Did you decrease conversion rate? Maybe your new visitors don't actually want your product. And so on, indefinitely, including lots of side effects you might not have convenient statistics for.

I have a depressing sensation that at least half of corporate data science consists of abusing Goodhart's Law - finding a useful metric and then naively optimizing for it until it's no longer representative of business success.

>It's disturbing how often success is measured by optimizing a single metric, and how resistant people can be to recognizing issues with this approach.

this is very true. The zeal of data driven approaches sometimes reminds of "craniometry" where people tried to gauge intelligence by measuring the shape of one's skull.

The trade off of using quantitative methods is always that you might lose too much meaning. The good thing about data driven approaches is that they are transparent and enable objective decision making, but people need to pay close attention and be alert that whatever it is they are measuring still has some qualitative justification.

You preempted my followup article. I'm not sure where the balance between these two is, but I'm sure that many places get it wrong.
I'm definitely excited for that followup, then. I think this is an incredibly common issue in both directions, and a surprising number of companies seem to make both errors at once.
This is an interesting point.

I'm curious if one of the 'myths' of a data driven company is that you can instantly begin making decisions fed by 'real-time data' and learn after-the-fact from feedback loops. But for many legacy businesses the data pipeline for their important KPIs still moves slowly.

And then the data that does come in quickly becomes over-valued because everyone was sold the idea of instant gratification. So there is pressure to react to things quickly like meaningless web traffic metrics or local sales data - which may fluctuate heavily on a daily basis - instead of waiting for relevant patterns to emerge over longer periods.

Statistical significance and error rates are then overlooked in the name of a cargo cult data culture.

This is why business books can be dangerous or even destructive, as business advice from one person's experience is sold as generic design patterns that apply to every business - which isn't the case. This is why understanding the business inside-and-out is the most important attribute, then having MBA-esque skills/toolset is useful second. So you take the reality of the business into full consideration and apply tools to it, rather than seek out tools and pigeonhole your business into them.

On that point, Superforecasting by Tetlock is an excellent book on softer analysis, which will make perfect sense to quant readers.
Oh, not familiar, looks good! Ordered.
That's a really good point, and I've seen that myopia cause problems as well. Like you say, it's about understanding the limits and applicability of data science, and how it interacts with experience and qualitative strategy.
One key fact that I wish every product manager internalizes is that data science is a technology. And like all technologies, it may or may not be applicable to particular problem. And it may or may not be best use of an organization's time to invest in that technology v/s other options.

On the marketing side, just like a marketer will never market a database upgrade to users, she shouldn't consider marketing data science / ML directly to users. Users DO NOT care about using a data science enabled feature. They want value and progress in their lives and some times you may create more value by removing a form field than by investing in and delivering a data science project.

So use good business sense balance investment v/s reward for evaluating data science projects. I wrote about this here: https://growth.wingify.com/what-you-need-to-know-before-you-...

Wow, this article is exactly what is happening in my project including the "data lake" part. The even more infuriating thing is when everything turns into a bizzaro world. Our boss made us spend an hour talking about the difference between "data lake" and "data ocean". The only thing one could do is facepalm
You read a well reasoned article and have first hand experience but frankly these are just two data points. You lack the data to justify the effort for a facepalm ;-)

Imho. a lot of these unbounded data projects are the result not just of cargo cult but satisfying a deeper need i.e. management avoiding decision making. It is much easier to go for broad data collection than making a directional decision, building a targeted model and making real world changes that lead to meaningful fact finding. Dreaming of data oceans is less risky than navigating a puddle but the latter moves you actually forward.

> management avoiding decision making

A solid point, but I'd also add justifying decision-making.

The stated reason for projects like this is usually to guide better decision making, which is only possible if the project succeeds. But as long as the project produces some kind of comprehensible output, it can be used as an excuse for making new decisions or changing old ones.

Dilbert used to have a lot of strips about managers using re-orgs to bury their bad decisions. From some of the horror stories I've heard lately, data science and analytics have taken over that role at many companies, helping to cover up power grabs and backtracking under the guise of "listening to the data".

This article seems to speak unfavourably towards building a basic BI infrastructure. I don't know why. There is immense value for having a single trusted source of truth where the most basic business questions can be answered ad hoc, with a suite of simple visualizations and KPIs that cover the most important facts of how the business is doing. Data like this serves as a crucial element of communication across different departments in the company.

One use of a strong BI infrastructure, that is under-appreciated, is as a sort of test suite for the business. If an important metric changes, it's extremely costly if it is not discovered very quickly. This also can lower the cost of business risks. In other words there can be much value in visible data that points to nothing new. BI isn't just a method to look for business improvements, though it certainly can be that, as well.

BI and data warehousing infrastructure doesn't replace more targeted and specific data science projects, it complements them.

I guess I was a bit harsh, fair call.

I'll respond this way: basic BI infrastructure is the bedrock of an organization's data capabilities. The best way for an organization to build their initial BI/DS capability is to address some business problem, by building a warehouse. That's in contrast to building a warehouse and then looking for problems to solve with it.

So you're right, thanks for taking time to comment :)

In my experience, there're (at least) two ways of using data in an organization:

1. Top down. Where you start with a problem/decision, and use data to inform it. "What phone plan should we offer our customers?" per the article is topdown, and data science can help inform the answer.

2. Bottom up. Where you start with a bunch of data, and try to brainstorm, "OK what cool things can we do with this?". I worked for an IOT company that collected a bunch of sensor data and we'd run into this all the time. We'd take our best shot and report back to clients, who'd say, "Cool, but what do I do with this?". Not saying you can NEVER come up with something useful, but it's a lot harder.

However, that assumes that someone presenting an analytical presentation will be viewed more favourably

Well, it certainly isn't helped by data scientists claiming to be better than ANY programmer and ANY statistician. Who could possibly live up to their own hype?

A DS and ML winter will follow just as it did for AI.

Wat? I don't think any DS is claiming to be better than any programmer and statistician. I think the anecdote you refer to is, a DS is better at programming than a statistician and is better at statistics than a programmer. This viewpoint holds up in my experience.
I thought my ire at the term "data science" would have worn out by now, but it hasn't. To me it is a utterly meaningless term whose adoption in itself speaks volumes about the dynamics behind it.

As someone who has been doing "data science," including the programming, to me watching this trend has seemed mostly to be about hype and non-STEM-types, especially in business management and other similar areas, picking up on the importance of quantification.

I can think of two things that seem like legitimately very novel trends in my career in this area: deep learning, whose frameworks were largely abandoned in the preceding decades, and management of very large datasets. The first surprised me, the second I was talking about for years before it happened. The first seems so specialized to me, and to come after the "data science" trend, that the "data science" label seems unnecessary; the second is now usually discussed in terms of "data engineering" which I'm totally cool with.

There's a tendency to somehow suggest that the data science label is justified because statistics is all theoretical and not enough about real-world data, but that's always seemed to me to be a strawman that people erected to justify business hype labels to further their career. What it boils down to is playing off of business management's confusion that "statistics"=census numbers, counts, etc. It ignores the decades of computational statistics that was developing, and the fact that statistians are forced to deal with data as part of the field.

I wish I could find more of the papers I've read that illustrate the frustrations of statisticians and other scientists with data science. This will probably suffice, although there's more cogent, heartfelt examples: http://magazine.amstat.org/blog/2015/11/01/statnews2015/

It's difficult to describe, but for me personally it goes something like this: for years, you use R, C/C++/Python, Lisp, etc. to solve really difficult stats problems, are trying to be careful so as to not do something irresponsible. You've done work on supercomputers, laptops, you name it. Then, all of a sudden, there's an explosion of blogs, etc. talking about R, mahalanobis distances, and optimization routines as if they were discovered yesterday, by this brand new field of "data science" that's revolutionizing the world. All of a sudden because you don't know Cassandra or Spark, even though you're familiar with a lot of the underlying concepts because you've had to manage large datasets, and don't have a comp sci degree.

I don't mean ill will toward the practitioners, but it's difficult to convey what it's like to watch your field get repackaged and resold because of other peoples' misunderstandings about what it's about.

That's fine. How do you describe a software engineer? Someone who codes? Makes APIs and tools? Handles security? Handles servers? Implements UI/UX?

So do you equally think labelling of software engineer is meaningless because it's broad?

Data science envelops many many many different sub-fields and specializations, many of them not involving any science at all, but some of them do involve science (understanding structure through observation and experimentation).

Maybe you don't like us being called "Scientists"? I can go to a journal, read research articles, and point out ones with horrible statistical analysis. Are those authors more of a scientist than I am, because they are arbitrarily in "academia"?

Finally, a dirty little secret is that the more data you have the less statistics you need. I bet even Google knows this, and their data dept. is probably the best academic statisics dept. I've ever met.

Yes. The blogpost is about the organizational difficulties in unlocking the value of technically sound "data science" projects, but these in turn are the tip of an iceberg of "omg watson" on the executive side and "machine learning does well on $archetypal_dataset, it can do anything!" on the techie side.

A while ago there was a Kaggle project to solve certain conjectures on prime number theory. Seriously?

> A while ago there was a Kaggle project to solve certain conjectures on prime number theory. Seriously?

I've seen a surprising number of Kaggle projects setting (or claiming to achieve) objectives that look impossible - things like extracting complex insights from such short signals that they apparently violate the pigeonhole principle.

The worst demonstration was looking at the results of a college class with "do a Kaggle project" as the final task. It was painfully obvious that all of the 'best' results were either extreme overfitting or fake data science (that is, using a strong algorithm to start and getting no gains from training).

Which means that many of the soon-to-graduate students had concluded that good data science meant getting strong results, not producing reliable and novel insights. It felt a bit like a software-centered version of what social psychology has been suffering from.

Got a link to that Kaggle competition?
Here's the link. It was a playground competition (i.e., no rewards) - "This competition challenges you create a machine learning algorithm capable of guessing the next number in an integer sequence. While this sounds like pattern recognition in its most basic form, a quick look at the data will convince you this is anything but basic!"

https://www.kaggle.com/c/integer-sequence-learning

> Well, it certainly isn't helped by data scientists claiming to be better than ANY programmer and ANY statistician.

I don't really think that's much of a thing? I've been working in the field for several years now, and I'd say majority of my efforts when communicating with stakeholders is about _qualifying_ our capabilities and managing the expectations they're already coming in with.

It's quite a famous quote, I think it was the chief data scientist at LinkedIn who coined it originally.

There is real value in being a "statistical programmer" but that value can't presently be seen past the smoke and mirrors.

I think you're thinking of Josh Wills, Cloudera at the time, now Slack:

https://twitter.com/josh_wills/status/198093512149958656?lan...

> Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

A lot less braggadocious than what you're suggesting, it's just talking about it as a jack-of-all-trades type of job.

"Average" would have been more realistic, and as I say there is value in the role. If DS keeps promising and failing to deliver miracles, it will never be more than a fad. Someone with a job title of "applied mathematician" already does was DS claims to do... the title of "statistical programmer" or "statistical engineer" is a better one, than "data scientist".
Yeah, I agree completely, I just don't think that the hype of the role is coming from practitioners.

I know this isn't really your point, but I've met Josh Wills, for example, and listened to many of his talks. I don't think I've met a more realistic guy (among actual practitioners) when it comes to the expectations and reality of doing corporate data science. The hype, I'd say, is just an emergent phenomenon of the tech reporting cycle. Nobody's out there _trying_ to inflate expectations, except a few consultants and "thought leaders" maybe.

"Experiements in data science"

Is the blog's title misspelled?

no?
"Experiments" doesn't have four Es. Unless it's meant to be some sort of pun on "experience".
Yes, I was just having a bit of fun. People cannot even point out spelling errors without hedging their statements. (right?)
I was unsure because the blog is two years old. Two years with a typo in the title is quite a bit.

Maybe it was some kind of subtle joke that I do not get.

I’d say this piece applies outside of data science, too. It’s a nice reminder that technology can lead to culture change, but cannot drive it
Technology is the defining change of our life times.
You might be surprised at how normal it is to just use technology to do the same thing faster. I'll wager most business people think of computers as glorified typewriters than can also send "memos".
An organization executive is not a stakeholder. At best she is a leader, a formulation, an innovator. At worst she is a parasite. But not a stakeholder.
In the sense that the executive has the power to enable your project for six months and risks losing $500k of stock options if she is wrong I think she is a stakeholder.

I wish I had some steak like that!

This is all very old stuff.

One earlier version was for AI expert systems.

Then there was object request broker architecture.

Such considerations were ubiquitous for the biggie operations research (OR) with optimization, simulation, etc. OR was so big that it was required in B-school programs.

Similarly for management science.

The lessons for how to make applications, as in this OP, were all there in the past. Indeed, operations research (OR) and management science (MS) merged to become OR/MS with a journal Interfaces that talked a lot about the points in the OP.

I went through a lot of that history and discovered lessons much like those in the OP.

> Fundamentally, to be a data driven company, data needs to be part of the internal dialogue spoken by all members.

Okay, let's stop right there! Who the heck, why, where, when did anyone ever say, argue, justify that any company should be "a data driven company"? Maybe a "market driven company", but data driven?

Really, for what kind of company should have, there is very wide agreement, from a home based business to Wall Street, and that is a money making company!

What turns on the CEO and the BoD is making money!

But not nearly all projects, data science, ..., Taylor's time and motion studies, are directly connected with making money. E.g., when I wrote software to schedule the fleet at FedEx, the main goal was just a schedule, printed out, on paper, with departure times, flight times, arrival times, etc., that would pass expert review as "flyable". Actually, saving money, i.e., optimization, was of much less interest.

> So, to avoid a cargo cult of data, organizations should stop chasing technology and start working with experienced technologists who can apply technology to solve organizational problems.

Yup.

> Executives, to understand how their project relates to company goals, and how success would be reported.

Really, reasonably well experienced problem sponsor executives will ask "Why should I do that?" and need a good answer or won't do it. Sure, one reason to do the project may be just to be playing with the latest buzz words, but most organizations have highly sensitive BS detectors that will be triggered by buzz words.

> With their bosses demanding analytical results, managers will demand analytical results from their peers, and so on, down throughout the subgroup.

Why would bosses be "demanding analytical results"? How many bosses understand good analytical results versus a lot of BS, have an accurate view of the potential of analytical results, could explain why it might be good for results to be analytical, know how to do projects that yield solid analytical results, or see how analytical results could help their careers or the goals of the company? Answer: Only a small fraction. E.g., only recently has Wall Street taken analytical results seriously for trading instead of intuitive, judgment stock picking.

> My reasoning was simple: anyone with data science on their side would be able to prove that their efforts worked better than their peers.

Then? How about the peers feel threatened and mount a gossip and sabotage campaign against the data scientist and their work? The management chain can also feel threatened.

> Basically, I had assumed a data-driven culture exists, when in reality businesses are struggling to create that culture in the first place.

They are not even "struggling to create that culture". It is a fertile, gullible imagination that believes that many organizations believe that they want "a data-driven culture".

> Data science is best viewed as a form of company culture, rather than a set of technologies.

No. Data science is best viewed as a technique, box of tools, that sometimes can, likely with work with other tools and techniques, yield some valuable results.

> I argue that it’s best to spread a data-driven culture from the top of an organization down, by requiring that reports be analytical.

Neither the spreading nor the requiring will work. Only a tiny fraction of the people in the organizations have significant ability with data science, and they will NOT make any such spreading or requiring of something they don't understand possible in the organization.

> Solutions that help measure and improve the performance of a part of the company (“we’ll help you measure marketing ROI”, or “we will introduce predictive maintenance), will spread and become enduring organizational strengths.

Not really. For "enduring organizational strengths" look to, say, high quality reasoning, writing, and presentations, powerful innovation, high determination, careful attention to the markets and the customers.

For "Solutions that help measure and improve the performance of a part of the company", that will be down somewhere near a good company Web site, good telephone courtesy, keeping lunch breaks under an hour, stopping pilfering, having good computer network management, having good computer security.

Sometimes data science, or just call it applied mathematics, and the rest of math, can mean super big bucks for a company:

Supposedly a big example is the trading software of James Simons's Renaissance Technologies.

IIRC once the CEO of American Airlines said that their subsidiary Sabre for reservations and scheduling was so important he'd sell off all the planes and just keep Sabre.

Likely the old linear programming application of the diet problem is still used effectively (i.e., save big bucks) in feed mixing for livestock, cat food, dog food, etc.

Linear and non-linear programming are likely still pillars of, worth big bucks for, operating an oil refinery.

There may be some big bucks from applying math to ad targeting on Web sites.

For large projects, the old linear programming application of "program (or project) evaluation and review technique, commonly abbreviated PERT, .... PERT was developed primarily to simplify the planning and scheduling of large and complex projects. It was developed for the U.S. Navy ..." Closely related is the "critical path method (CPM)".

https://en.wikipedia.org/wiki/Program_evaluation_and_review_...