Hacker News new | ask | show | jobs
by sidlls 2422 days ago
"Obfuscation" and "delusions of grandeur" are practically synonyms for ML and Data "Science" in this industry. I've been around for a while and I've never quite seen something as over-hyped and hyper-glamorized as these two specializations.
4 comments

Calm down. Machine learning is a part of software engineering. Like multiprocessing, computer graphics or network protocols. It is here to stay. It is a part of a pallete of algorithms with which one can build software.
Your comment is absolutely correct but further points out just how far astray data science has become from any meaningful work. This issue is that a huge number of "data scientists" have limited programming ability and nearly zero engineering sense.

As a perfect example of this is the trend in most places I've seen where data scientists strive to increase the complexity of their model (so they can prove how "smart" they are). A huge part of a software engineering education (whether in the classroom or in dev shop) is learning that complexity is the enemy. No engineer would choose a 3 layer MLP over a simple linear regression for an imperceptible improvement in performance.

The additional irony of all this is that a decade+ ago a software engineer who had strong quantitative and numeric programming skills was rare and an elite find. You would have thought that the data science boom would have dramatically increased the number of these people but I find them even rarer.

What are these data scientists? Most statisticians I know would just use the linear regression unless they needed a neural network for marketing purposes. Statisticians will spend years studying linear regression and variations in graduate school. I thought it’s a CS guy who would be more fascinated with neural networks.
Well there are some fairly distinct camps forming in data science. You are correct that those coming from a statistics background would generally prefer simpler, more parsimonious models. There is a not-insignificant group that seem to be coming into the field via other channels (CS, boot camps, self-teaching, etc.) who view statistics as a field as a bit of a dinosaur and therefore the statistician mindset to be backwards. Simpler models aren't a good thing, they are a bad thing. Any amount of increased complexity is worth even a small amount of improvement in performance.

I think some of this is exacerbated by modern pillars of machine learning and data science. Competition sites like Kaggle are entirely based on maximizing test set accuracy, and so winning submissions these days are huge morasses of ensemble methods that are trained for days and weeks on GPUs, but in the end they are often only marginally better than some of the fairly basic standard approaches. And when companies like Google are building their bots for Go or Starcraft, they are using cutting edge techniques. When people see that and get inspired to get into data science, thats what they want to do, even the the majority of problems are more rooted in data quality, thoughtful understanding of the problem, and more rudimentary methods.

Its also the result of some of the rhetoric of important figures in the field. Yann LeCun has pushed back strongly in the past on criticisms of modern day machine learning's occasionally lack of concern with introspection and model understanding. Judea Pearl, a Turing award winner for his work in machine learning, devotes large portions of his pop-sci The Book of Why attacking the field of statistics on the whole, as well as engaging in multiple attacks on historical influencers in the field with such ferocity it borders on character assassination. He has even rebuffed modern critics, such as the very widely respected Andrew Gelman, by saying they are "lacking courage" by failing to accept his "revolutionary" causal inference methods over the traditional ones used in statistics.

The attitude is driven a lot by the people and institutions at the top, and as someone in the field, I unfortunately encounter this kind of thinking way too often.

Thanks for sharing your expertise. It was very interesting to hear your perspective.
Yes! It's one tool in the software engineering toolbox! It's a great tool for some problems!

Due to the hype it becomes a goal in some organizations however. "We need to do machine learning because we have big data" or some such. Doesn't matter if the problem could've been solved in 5% of the time and cost with 20 lines of code, thou shalt use machine learning.

It doesn't help that data scientists (creating and training the ML model) and software developers (creating and maintaining the software) usually come from different backgrounds, requiring a "data engineer" as an additional intermediary.

It always a problem with hype, blockchain (or merkle trees) has the same problem but worse, because the problems it solves well are rarer and more narrow.

To me, it seem to be larger than one tool. I think of it as a color in a pallete, with which one can paint software. Octarine.

To put this statement into context, I'm speaking as someone who had been writing code in C, from the era of PC XT. Perhaps NIPS 2010 was a rite of passage to ML for me. There is a screen, full of industry grade C++ and PyTorch in front of me, right now...

ML can be useful, but it is getting too much attention. Far more hype than the value it actually provides in many domains, IMHO.

Yes, I know that there are folks that deal with vast amounts of data with inscrutable relationships where you need fancy algorithms to make progress. But seriously, most problems just don't need it, and many folks would be better off with mastering basic statistics and data analysis.

It's fascinating how far you can get with basic stuff. My favorite? Statistics for Experimenters, by George E. Box. It's like a secret weapon! https://www.amazon.com/Statistics-Experimenters-Design-Innov...

Heh, given that I am starting to see more and more companies that offer ML engineers $2-6k/month (before tax), it's starting to resemble gaming industry in all its negative characteristics instead.
I cannot tell from your comment whether 2-6k/month before tax should be considered a lot or a little. I think in the major tech centers that 2-6k/month is quite low for anyone with significant experience (>5 yrs). Do you disagree?
They used the word negative.
Since he compared to gaming I think he's saying it's low
Does "blockchain" get an honourable mention?
Definitely.
Really?

Were you around during the dotcom era?

Although I'm not old enough, I've heard that OR in the 80s was the same crap.

Nobody talks about operations research today. But techniques that fell under that umbrella, like ARIMA and linear programming are still widely used, and aren’t going anywhere. (And it’s not without some irony that automated bulk time series forecasting is now sold as AI).
It's funny but at my last company, one of our systems used some linear programming to generate a model of physical processes.

The problem could have been tackled with greater accuracy using machine learning, but it would have taken a long time for the system to generate enough data points for a sound model and would have required more storage space. This was also complicated by the fact that the model had to be regenerated whenever the physical system being modeled was changed.

The linear programming solution was a lot cheaper and was "close enough" to serve as a useful approximation.

Linear and quadratic programming are amazing and totally underappreciated. Often they are the fastest way to get useful answers for problems (the solvers got really good over the past few decades).
What's OR?
What's OR?

https://en.wikipedia.org/wiki/Operations_research

Basically a mathematical approach to problems of logistics and scheduling developed first in WW2. Very powerful in the domains for which it was developed but less generally applicable than enthusiasts hoped, leading to the usual “hype cycle”.

If you have a problem OR could solve or just want to fool around with it PuLP is very easy to use https://pythonhosted.org/PuLP/ Of course the ease of use means that it is a commodity skill now.

There is also Google OR tools.

https://developers.google.com/optimization