Hacker News new | ask | show | jobs
by laingc 2640 days ago
Whenever anyone asks this, I always wonder whether I live in a bubble or they do.

Creating simple predictive models where your problem is already easily narrowed down to a "given x predict y" definition is pretty trivial. Having it automated is nice, but not exactly a hard thing to do.

Genuine question: how many people have jobs where those kinds of problems form any significant part of their workload?

I also often see a response to this sentiment along the lines of, "Yeah, but there's also data cleaning..." etc. My reaction to this is mixed. I mean, sure, there is also data cleaning involved, but is this really where people spend most of their time?

My team spends most of our time doing the following:

1. Formulating problems. Figuring out the various different ways that a real-world problem can be expressed mathematically and feasibly attacked computationally.

2. Engineering software to implement the solutions to these problems, sometimes using some of the (amazing) frameworks out there for ML or probabilistic programming, but often having to develop our own approaches from scratch.

3. Doing all the management, stakeholder relationship stuff, business cases, etc. that make your work relevant and possible.

4. Getting data. Always an issue.

I'm very genuine in my curiosity here: are we total snowflakes, and most data scientists spend their time cleaning data and building "given X predict y" models?

2 comments

How many business analysts/low level coders have jobs because they just implement the same repeated CRUD screens/wireframes or maintain WordPress themes? Not the same as data science, but close.
I think it's possible that many people's "cleaning data" has some overlap with your "Getting data".

I know for me I've had things like a bunch of scanned images of tables as "data". Turning that into something useful took a lot of time.

Whether this is "getting data" or "cleaning data" depends on perspectives and definitions.