Hacker News new | ask | show | jobs
by proverbialbunny 2259 days ago
>You may spend a minimal amount of time doing the “fun” parts that data scientists think of: complex statistics, machine learning and experimentation with tangible results.

I don't get why building a model people consider to be the "fun" part. That's mostly spitting data in, watching a loading screen, and then observing the output.

That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be. Likewise, learning the business side and seeing what is possible no one has considered is great fun too.

My favorite part is feature engineering. Pre-processing and cleaning is fun too, but morphing the data into formats that extract a diamond from coal is a lot of fun, and what data science is all about. Clicking go on some ML algo is just icing on the cake, seeing it reveal bits maybe even I overlooked in the data.

If you like ML why not be an MLE? That's what MLEs do, and they're a more desirable job. DS is all about the research, discovering and learning new information, and making the impossible possible.

4 comments

The standard whatever.fit(X, y) isn't very appealing but there are much more bespoke models that require creative engagement with stats/CS knowledge, e.g. Bayesian hierarchical models or deep learning models that are more complicated than what can be copy/pasted from Medium.
I've done a lot of ensemble and stacked ensemble learning. I've also used BERT and a couple of other advanced ML, but usually I resort to advanced feature engineering if I can first, so I get what you mean, but it's still not as fun to me as figuring out patterns in data.
It's sort of two-sided, I think. It can be fun to figure out _meaningful_ patterns in data. I don't really find it fun to figure out that "so and so didn't use software that understood NA values back in nineteen tickety two, so some NA values are NA because they're newer, and some NA values are 0 because 0 is just like NULL in somebody's head, and some NA values are -999 because that was a thing they did in the Before Times."
MLE is a fairly new title that, as best I can tell, exists primarily in those few places that have a mature enough workflow to have people who can actually dedicate their time to the ML part and have other roles take care of the rest.

Everywhere else, there is only DS, and it involves everything.

To answer your first question though, the training and testing of these models is fun because it feels like a puzzle game: did all my understanding and preparation of the data (and the business) pay off and the model does its job as expected? Is there something I’m missing? What’s the simplest model + configuration I can use that produces acceptable results and what does that say about the problem space? Can I combine models in some way to get the results? Is nothing working because it’s an ultimately fruitless exercise and our hypothesis is wrong? Or is there something we’re missing that is in turn the reason the model is missing something? Etc etc.

Then as the output you get something that ingests some data and then makes a decision with it! That’s cool to me.

I get where you're coming from. I guess just the problem domain I'm in, and my experience level, I tend to get what I expect from a model, and if I don't I'm more like, "wtf?" which isn't anywhere as fun of a way to do that part of the process.

Also, I know what is possible and impossible before I start writing code (if you don't count EDA code). There are exceptions, like it should be possible but it turns out the data is bad, but it didn't look bad from the EDA. Thankfully I've never had that. I always perform a Feasibility Assessment before anything else.

Not to imply what you're doing is somehow incorrect. Problems can vary quite a bit and I recognize that. For example, there have been times where I've had to mine to see if anything is there, doing ML over it to validate a hypothesis then using that information to create a new hypothesis, rinse and repeat. That's scary, because I could turn up nothing. I haven't done a lot of mining I admit though. Usually my problems are much more obvious from the get go, or much more research intensive.

One time I did three months of reading papers on arxiv.org just to figure out if something was feasible and how to best do it. Though that was definitely not a standard problem.

> That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be

Exactly! This is the reason why I love my job. It gets even better when you uncover a non-intuitive insight.

Can you please elaborate on the feature engineering part a little bit?