Hacker News new | ask | show | jobs
by babs474 4389 days ago
In practice I find the bigger problem is from analysts/actuaries/statisticians who have a disdain for programming, which sometimes is viewed as a task for mere technicians.

Typically your excel model/analysis has not even solved half the problem of a datascience system. It needs to be repeatable, it needs to be open to change (source control!), it needs to be integratable with the wider system.

These things need to be considered upfront. There are plenty of reasonable software tools for this. Yes hadoop shouldn't be your first step, but taking 5 minutes to put something on a server in ec2 (omg, the cloud) is not unreasonable.

There is a swallowing abyss between excel and production. That is where datascience projects die, its a shame.

1 comments

I've never met a statistician who either uses excel or has a "disdain for programming". R or Matlab are basic tools of the trade
I talk a lot of people who've had trouble with "data scientists" who are strong in statistics and know some matlab or R or something like that, but know nothing about the craftsmanship of programming.

By that I mean skills like using version control, writing software that is maintainable, working with a team that uses project management software, things like that.

A common kind of workflow is that a data scientist develops an algorithm and makes tweaks to it, and that this gets baked into a production system.

If the data scientist throws something over the wall and it takes the developers a few weeks to get it ready for real use, the "real time" productivity of the team is going to be awful. The closer we come to the data scientist checking the changes in and that's that, the more valuable the data scientist is.

This is absolutely a fair comment, coders but not software engineers, and is the same problem that's permeated bioinformatics for the last decade or so. (As an aside, it's fun hearing grand claims about data science revolutionising medicine in 10 years [0], when the same claims were made about bioinformatics 10 years ago.)

[0] https://twitter.com/HanChenNZ/status/473825783874859008

R and matlab are better, but those tools also have issues integrating into production depending on what you are doing. It's not so much the exact tool you use, but just having a little forethought about how your creation is going to interact with a production system.

A lot of people feel programming is undervalued in academia. For instance Hadley Wickham creator of ggplot2 probably hasn't gotten the recognition he deserves. With a prevailing attitude such as that is it any wonder academic code has such a poor reputation?

Whickham notes that he thinks tides are changing. I agree that it is, as a part of the datascience phenomenon. As part of the change you are going to see a few more macbooks, some cloud servers, maybe a guy with glasses talking about version control and software design. It is not all garbage, I hope you keep an open mind.

Q:Do you feel that the academic culture has caught up with and supports non-traditional academic contributions (e.g. R packages instead of papers)?

A:It’s hard to tell. I think it’s getting better, but it’s still hard to get recognition that software development is an intellectual activity in the same way that developing a new mathematical theorem is.[1]

1. http://simplystatistics.org/2012/05/11/ha/

Integrating in production is a huge biggey. I hope to be spending a lot of my time this summer sharing / educating folks abou some tech I've built to make putting interesting Analytics Into production.
I use matlab.

then I use excellink to send everything in matlab to excel.