Hacker News new | ask | show | jobs
by Daishiman 4067 days ago
This is suspiciously close "data science" and "machine learning" experts.

Can't we just be honest and say that most of these are applied statistics jobs with a specialty in large volumes of data? Or is "statistics" just not fashionable enough nowadays?

11 comments

Perhaps more precisely, they're "statistical engineering" jobs. A machine learning PhD can derive an algorithm and provide you with a reassuring bound or guarantee regarding performance in terms of runtime, convergence, etc. They have to be able to understand not just the volume of the data but also how to trade off accuracy for speed, and myriad other constraints.

IMO, the "data science" label is too broad to properly differentiate statistical engineers. A fine definition for a data scientist is someone who runs experiments on user/company data and can assess the results. It's important work, but you don't need a PhD in stats or ML to do basic hypothesis testing.

You could simply call them "machine learning" experts, but that could be a bit too academic. People who are focused narrowly on theory or niche areas may be experts in ML, but they may also never do anything outside of running matlab simulations. It's unlikely that those people will make very good statistical engineers since they may never have had to think about the challenges involved in scaling algorithms.

There is quite a big difference between statistics and machine learning. A lot of the most successful machine learning algorithms do not have a statistical grounding or did not when they were invented. E.g. neural networks, SVM, low rank matrix approximation, k-means, decision trees/forests. Statistics is one of the tools in the machine learning toolbox.
SVMs were invented by a couple statisticians/mathematicians in the 60s. k-means also harkens back to the 60s, by mathematicians and control theorists. Decision Trees and Random forests were invented by a famous statistician, with the latter related to bootstrapping, a statstical technique. PCA and factor analysis, forms of or closely related to low rank matrix approximation, were pioneered in the early 1900s, by some of the most famous statisticians ever.
Something that was invented by a statistician is not necessarily statistics, and that certainly applies even more to something invented by a mathematician. I guess with a broad enough notion of statistics some of these would fall in the field of statistics, but if something does not use at least one probability distribution it's probably far fetched to classify it as statistics.

It would be a lot more fair to classify machine learning as a subfield of convex optimization. Yet even that classification does not quite fit, so it makes most sense to just accept that it's a separate field which uses techniques from statistics, convex optimization, computer science, and more.

But, to look at one example: neural networks. Neural networks may have been inspired by attempts to recreate the structure of the biological nervous system, but the way in which they are used commonly, e.g. "learning" via back-propagation, is really just a statistical regression for a gigantic equation with many free variables.

My preferred term is "predictive analytics," which I feel kind of straddles statistics and machine learning, and also serves as a nod to a common difference -- "statistical" methods often yield understanding, while "machine learning" methods are often opaque to human insight but yield predictions.

I feel annoyed with opaqueness of ML algorithms like neural networks. I hope ML doesn't unwittingly define itself as a field where machines learn, but humans may not learn. I'm referring to predicaments like the story about 42 from hitchhikers guide to the galaxy.
That's definitively an interesting problem. Just note in many cases we're not even interested in learning the tasks. For example, you don't need any person to actually know that consumers aged 25-29 years old prefer a certain product 10% more than consumers aged 21-25, and so on.

But humans are still the ones responsible for important high level decisions, so it still makes sense to maximize information transparency to enable good decisions in those contexts.

A neural network that given a prediction 'X is most likely' and could answer the question "Why?" with 'Because Y' would be amazing.

Just because you're using statistics it doesn't mean that you are a statistician. For example, most of particle physics is based on statistics, but this is not enough to motivate them to rename the field. Whole fields of engineering use statistics every day, machine learning is just another.
These subjects (plural) are all plagued by the same problem: definitions of terms. One mans intelligence is another mans dire stupidity, and so on and on it goes, chasing its tail.

The most value I got from this article was in the realization that, every few years or so, the academic globes align well enough (some paper de joure becomes well-read I suppose) that .. for a brief instant .. terms are defined well enough, and gain enough agreement, that progress is made .. which progress attracts more eyeballs, who tend to want to break off a chunk for themselves, and the terms begin to differ again and we have a whole new 'sub-sub-sub-' variety of the subject.

So its all about globes aligning, basically. I will now go off and implement an AI technique based entirely on the description of globes, alignment, and little chunks breaking off every now and then .. see you at the top of the AI heap in a year or ten.

Your comment might mistake "data science" as being more than a neologism for describing something that hasn't yet taken enough concrete shapes to be clearly defined.

I think many would agree that "machine learning" and/or "deep learning" are at least cornerstones for "artificial intelligence". After all, nobody singularly defines intelligence.

I imagine the "AI experts" will be paid significantly more than data scientists and machine learning experts, just like "software engineers" are paid more than "software developers" and "programmers".
I often heard the Big Data guys hype that there's no sampling in Big Data, you have the whole data, so it's not exactly statistics.
I've heard this too and it's a great way to demonstrate you don't really know what statistics is :)

Statistics is not (just) opinion polling, there's a lot more to it than estimating observable properties of a population.

If you're trying to make decisions, predictions or estimates which involve any uncertainty at all (and in my experience big data almost always is), then it's definitely within the purview of statistics even if you have data for the whole population.

Sources of uncertainty include trying to say anything at all about the future (do you have data on the future population? no didn't think so...), trying to make predictions which generalise to new data in general, trying to uncover underlying trends or patterns behind the data you see which aren't directly or fully observed.

Often people expect big data to be able to answer big numbers of questions, estimate big numbers of quantities, or fit big, powerful predictive models with lots of parameters. In these cases statistics can be particularly important to avoid reporting false positives and to make sure you can quantify how certain you are about your results and your predictions. (Amongst other reasons).

Not to mention: having all the data, and comprehending all the rows on an individual level, are two very different things. Doubly so if the data is irregular (I'm currently doing fuzzy matching on really mangled street address data. ICK).

Once you hit millions of rows, it's not humanly possible to survey the data. All you can do is make assertions about the data's structure / buckets it will fall into. You then try to disprove that assertion, or establish an error bounds on it. You will never see all the data, only the results of assumptions you've made about it.

The refined pieces of information that people can look at to make decisions are called "statistics".
Presumably you want to draw an inference of some sort from the data. Otherwise what's the point of even looking at it?
from my distant memory if you sample size is the pollution its still statistics
Population I of course meant to say !
you mean Pig Data?
What's become very clear in the past ~5 years is that we're seeing the emergence of a new field, very distinct from statistics. The closest equivalent of the new machine learning field is electrical engineering, which has now heavily shrunk. Indeed, many former EEs have made a natural transition into this new field.

The new machine learning is about building layers of components on top of each other, very much like circuits seen in EE. The "circuit" components being used are no longer well defined mathematical pieces built from the bottom up using ideal assumptions, but less well understood, somewhat black-box newer components that were built from the top down. Far more like a type of engineering than a type of statistics.

If you haven't been seeing all the latest Arvix papers, you're really missing out. It's evolved to look sharply different than statistics now.

As mentioned in the article, AI is the broader field that encompasses Machine Learning, and to a large extent also Data science (and Computer vision, NLP, Pattern recognition, etc.). And while data science might utilize a lot of statistical techniques, it is a huge stretch to consider the whole AI field to be 'statistics'.

In general, AI borrows many more techniques from mathematics than it does from statistics. However, the field of AI has been quite established since the 1960's, and many techniques have been developed within that field as AI techniques, it's more about being accurate than about being fashionable as AI simply isn't 'just' statistics.

I believe the OP's point is that the demand is for applied statisticians and not for AI experts (in the sense exactly as defined by you).
The article is about tech firms and universities stocking up on research centers of AI experts, with the claim in its title that there is a high demand for those AI experts.

There might also be a demand for applied statisticians, but that doesn't make AI experts statisticians. I understand the confusion, as the term AI is often misused, but when you see the names mentioned in the article it's clear they're talking about actual AI researchers.

I think there are two levels. On the one hand, many firms need big data experts who can reason statistically and apply machine learning techniques to their domain. This started out 15 years ago as predicting shopping cart basket items etc..

On the other side the big tech companies are investing heavily in Deep Learning for things like NLP, Speech, Vision, Siri, and wherever else these neural net approaches may work etc...

> Can't we just be honest and say that most of these are applied statistics jobs with a specialty in large volumes of data?

But isn't this the approach Nature herself is taking?

The knowledge engine you carry in your head spends years just "learning" the world - which means, it absorbs huge amounts of input, sorting the good stuff from the bad. It "knows" what works simply because that stuff happens more often; it "knows" what doesn't work because that stuff doesn't happen very often.

And sure there are higher layers of integration there, but the whole process is strongly supported by a statistical approach.

Fully agree. I mostly lost interest in AI because modern AI goes toward statistics. Unfortunately, symbolic AI doesn't get much traction nowadays, and that is understandable.
Did you take notice of https://news.ycombinator.com/item?id=9432601 a while ago?