Hacker News new | ask | show | jobs
by mrwebmaster 2640 days ago
There are some discussions on whether data scientists are going to be replaced by automatic tools in the near future.

Can this be considered an example of a tool that partially replaces the work done by a data scientist? At least it can save a lot of time.

5 comments

Whenever anyone asks this, I always wonder whether I live in a bubble or they do.

Creating simple predictive models where your problem is already easily narrowed down to a "given x predict y" definition is pretty trivial. Having it automated is nice, but not exactly a hard thing to do.

Genuine question: how many people have jobs where those kinds of problems form any significant part of their workload?

I also often see a response to this sentiment along the lines of, "Yeah, but there's also data cleaning..." etc. My reaction to this is mixed. I mean, sure, there is also data cleaning involved, but is this really where people spend most of their time?

My team spends most of our time doing the following:

1. Formulating problems. Figuring out the various different ways that a real-world problem can be expressed mathematically and feasibly attacked computationally.

2. Engineering software to implement the solutions to these problems, sometimes using some of the (amazing) frameworks out there for ML or probabilistic programming, but often having to develop our own approaches from scratch.

3. Doing all the management, stakeholder relationship stuff, business cases, etc. that make your work relevant and possible.

4. Getting data. Always an issue.

I'm very genuine in my curiosity here: are we total snowflakes, and most data scientists spend their time cleaning data and building "given X predict y" models?

How many business analysts/low level coders have jobs because they just implement the same repeated CRUD screens/wireframes or maintain WordPress themes? Not the same as data science, but close.
I think it's possible that many people's "cleaning data" has some overlap with your "Getting data".

I know for me I've had things like a bunch of scanned images of tables as "data". Turning that into something useful took a lot of time.

Whether this is "getting data" or "cleaning data" depends on perspectives and definitions.

Predictive modeling will be nearly automated (except in cases where manual feature engineering helps).

Data science will focus more attention on solution finding, data gathering, cleaning, ETL, and business.

I worked at a startup whose first service was "upload CSV, and we automatically generate interesting charts". That's not trivial, but it's not exactly rocket science, either. The main trouble we had is that in order for this to work well (and it makes for a terrific demo), you need to start with a great CSV file. The average CSV file you find on the web isn't.

CSV is about the barest amount of specification in a file format. It's common to run across files which are some weirdo encoding you can't easily detect, or are a mix of multiple encodings, or a mix of line endings, or which should be treated as case-insensitive (or only for some columns), or which have weird number formatting (or units, and not the same units in every row), or typos and spelling errors, or it came from an OCR'd PDF and there's "page 2" right in the middle of it, or they tried to combine multiple files together so there's multiple headers scattered throughout the file (or none at all), or the top has different columns from the bottom, or it uses quoting differently (obviously not per the RFC), or it's assumed that "nil"/"NULL"/""/"-"/"0" are the same, or ...

In short, data (which hasn't been cleaned by hand) sucks, and CSV doubly so. If you want to put your AI/ML smarts to work, write a program to take a shitty CSV file (or even better, a shitty PDF file!) and generate good clean data, plus a description of its schema. That would be an amazing tool.

So far, OpenRefine is the nicest tool for this that I've seen. Figure out how to make it fully automatic, and everybody with piles of raw data (governments) will beat a path to your door.

What was the name of that chart startup?

I have been thinking about making an open refine type tool for python. Every time I do data cleaning in python, it feels so repetitive.

In my mind, that is a hard question to assess. Hopefully, more tools will exist for automating data cleaning and ETL such as handling third-party data schemas and integrations, data errors, and so forth. I don't this means data science jobs will be eliminated. Rather, one data scientist should be able to handle integrating more data sources and investigating a larger number of models. New and novel data will continuously present itself though, and if the total volume of data continues to grow at its current rate, these tools might just allow for a data scientist to keep pass with the growth of data. Hopefully though, progress allows for everyone to focus on higher-level tasks versus cleaning data and building pipelines.
Modeling is just a small part of data science (the percentage of time I've spent modeling as a data science is in the single digits).

Automating modeling is a bit easier than automating the other parts, though.