Hacker News new | ask | show | jobs
by gaius 3244 days ago
However, that assumes that someone presenting an analytical presentation will be viewed more favourably

Well, it certainly isn't helped by data scientists claiming to be better than ANY programmer and ANY statistician. Who could possibly live up to their own hype?

A DS and ML winter will follow just as it did for AI.

3 comments

Wat? I don't think any DS is claiming to be better than any programmer and statistician. I think the anecdote you refer to is, a DS is better at programming than a statistician and is better at statistics than a programmer. This viewpoint holds up in my experience.
I thought my ire at the term "data science" would have worn out by now, but it hasn't. To me it is a utterly meaningless term whose adoption in itself speaks volumes about the dynamics behind it.

As someone who has been doing "data science," including the programming, to me watching this trend has seemed mostly to be about hype and non-STEM-types, especially in business management and other similar areas, picking up on the importance of quantification.

I can think of two things that seem like legitimately very novel trends in my career in this area: deep learning, whose frameworks were largely abandoned in the preceding decades, and management of very large datasets. The first surprised me, the second I was talking about for years before it happened. The first seems so specialized to me, and to come after the "data science" trend, that the "data science" label seems unnecessary; the second is now usually discussed in terms of "data engineering" which I'm totally cool with.

There's a tendency to somehow suggest that the data science label is justified because statistics is all theoretical and not enough about real-world data, but that's always seemed to me to be a strawman that people erected to justify business hype labels to further their career. What it boils down to is playing off of business management's confusion that "statistics"=census numbers, counts, etc. It ignores the decades of computational statistics that was developing, and the fact that statistians are forced to deal with data as part of the field.

I wish I could find more of the papers I've read that illustrate the frustrations of statisticians and other scientists with data science. This will probably suffice, although there's more cogent, heartfelt examples: http://magazine.amstat.org/blog/2015/11/01/statnews2015/

It's difficult to describe, but for me personally it goes something like this: for years, you use R, C/C++/Python, Lisp, etc. to solve really difficult stats problems, are trying to be careful so as to not do something irresponsible. You've done work on supercomputers, laptops, you name it. Then, all of a sudden, there's an explosion of blogs, etc. talking about R, mahalanobis distances, and optimization routines as if they were discovered yesterday, by this brand new field of "data science" that's revolutionizing the world. All of a sudden because you don't know Cassandra or Spark, even though you're familiar with a lot of the underlying concepts because you've had to manage large datasets, and don't have a comp sci degree.

I don't mean ill will toward the practitioners, but it's difficult to convey what it's like to watch your field get repackaged and resold because of other peoples' misunderstandings about what it's about.

That's fine. How do you describe a software engineer? Someone who codes? Makes APIs and tools? Handles security? Handles servers? Implements UI/UX?

So do you equally think labelling of software engineer is meaningless because it's broad?

Data science envelops many many many different sub-fields and specializations, many of them not involving any science at all, but some of them do involve science (understanding structure through observation and experimentation).

Maybe you don't like us being called "Scientists"? I can go to a journal, read research articles, and point out ones with horrible statistical analysis. Are those authors more of a scientist than I am, because they are arbitrarily in "academia"?

Finally, a dirty little secret is that the more data you have the less statistics you need. I bet even Google knows this, and their data dept. is probably the best academic statisics dept. I've ever met.

Yes. The blogpost is about the organizational difficulties in unlocking the value of technically sound "data science" projects, but these in turn are the tip of an iceberg of "omg watson" on the executive side and "machine learning does well on $archetypal_dataset, it can do anything!" on the techie side.

A while ago there was a Kaggle project to solve certain conjectures on prime number theory. Seriously?

> A while ago there was a Kaggle project to solve certain conjectures on prime number theory. Seriously?

I've seen a surprising number of Kaggle projects setting (or claiming to achieve) objectives that look impossible - things like extracting complex insights from such short signals that they apparently violate the pigeonhole principle.

The worst demonstration was looking at the results of a college class with "do a Kaggle project" as the final task. It was painfully obvious that all of the 'best' results were either extreme overfitting or fake data science (that is, using a strong algorithm to start and getting no gains from training).

Which means that many of the soon-to-graduate students had concluded that good data science meant getting strong results, not producing reliable and novel insights. It felt a bit like a software-centered version of what social psychology has been suffering from.

Got a link to that Kaggle competition?
Here's the link. It was a playground competition (i.e., no rewards) - "This competition challenges you create a machine learning algorithm capable of guessing the next number in an integer sequence. While this sounds like pattern recognition in its most basic form, a quick look at the data will convince you this is anything but basic!"

https://www.kaggle.com/c/integer-sequence-learning

> Well, it certainly isn't helped by data scientists claiming to be better than ANY programmer and ANY statistician.

I don't really think that's much of a thing? I've been working in the field for several years now, and I'd say majority of my efforts when communicating with stakeholders is about _qualifying_ our capabilities and managing the expectations they're already coming in with.

It's quite a famous quote, I think it was the chief data scientist at LinkedIn who coined it originally.

There is real value in being a "statistical programmer" but that value can't presently be seen past the smoke and mirrors.

I think you're thinking of Josh Wills, Cloudera at the time, now Slack:

https://twitter.com/josh_wills/status/198093512149958656?lan...

> Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

A lot less braggadocious than what you're suggesting, it's just talking about it as a jack-of-all-trades type of job.

"Average" would have been more realistic, and as I say there is value in the role. If DS keeps promising and failing to deliver miracles, it will never be more than a fad. Someone with a job title of "applied mathematician" already does was DS claims to do... the title of "statistical programmer" or "statistical engineer" is a better one, than "data scientist".
Yeah, I agree completely, I just don't think that the hype of the role is coming from practitioners.

I know this isn't really your point, but I've met Josh Wills, for example, and listened to many of his talks. I don't think I've met a more realistic guy (among actual practitioners) when it comes to the expectations and reality of doing corporate data science. The hype, I'd say, is just an emergent phenomenon of the tech reporting cycle. Nobody's out there _trying_ to inflate expectations, except a few consultants and "thought leaders" maybe.