Hacker News new | ask | show | jobs
by paganel 3259 days ago
> The quality of the algo and I assume the deep learning model lies in the quality (breadth and depth) of the data, and how honest with himself the person choose to model it.

I've only dabbled with machine-learning here and there for the past 10 years or so, but if there's one thing I've learned so far is that the data behind your ML code (and the way it is structured) is responsible for almost all the success or failure of any given ML algorithm. I have an younger colleague at work who I've started tutoring, and he seems really interested in doing ML work (maybe because of all of the recent hype).

I've tried to emphasize to him several times that ML algorithms come and go and that he should focus a lot of his time on the data itself (from where he intends to collect it? how is it structured? is it reliable? is it "enough"? etc), but it looks that my data-related advice falls on deaf ears every time, he's only interested in me pointing to him the latest cool ML algorithm. I guess he'll live and learn, so to speak.

6 comments

> I've learned so far is that the data behind your ML code (and the way it is structured) is responsible for almost all the success or failure of any given ML algorithm

Data is indeed a necessary condition but certainly not sufficient. You require a good marriage between engineering features and data to have a good success rate. Learning curves [0] are a good way to understand if your ML algorithm requires more data or better feature engineering.

[0] http://mlwiki.org/index.php/Learning_Curves

Much of the programming with ML has moved towards cleaning, extrapolating and generating the data.

But this type of programing is - miracles- bugfree. We never hear of data-conversion gone wrong, data corrupted or data-mining withou conclusive results here. Obviously such bugs lack the glamour of security bugs.

It's also very difficult to catch these errors. Your trained model just doesn't work as well as it could, but how would you be able to tell?
> focus a lot of his time on the data itself... from where he intends to collect it? how is it structured? is it reliable? is it "enough"?

What's the best books on this subject? I suppose it's a very broad topic and thus more difficult to talk about than a single "neural network" algorithm.

Interested in what part of that you feel needs to be explained in more depth? Not sure reading several books is necessary for explaining data collection and data munging...to me it's definitely something best learned by doing.

work in data analysis/stats

Lots of things are best learned by doing. I just noticed there are dozens of books about machine learning algorithms but none on how to gather data. Of course, both those things can be learned independently, but I think there's room for at least a few books about data gathering considering it's so important for good machine learning results.
Here at Manning (we're publishing Francois Book) have something in our early access program on this now - https://www.manning.com/books/the-art-of-data-usability
This is the domain of statistics, isn't it?
Agreed. AFAIK, only statistics has addressed the question of info sufficiency in data and discriminative power of method. Personally, I think the former is an enormously important subject that isn't addressed well in most ML texts. How much data is necessary to answer a given question in practice? How do you know if your data or method are "good enough"?

From what I've seen, statistics addresses these questions better than CS-taught ML does. CS-based ML is no different from algorithm analysis; it suffers from sensitivity to limits inherent in the data. But ML courses often don't address these limits very rigorously. Yet knowing those limits is all important when effectively mining information at a professional level.

If you can't tell the decision maker what you know and what you don't, your inference/prediction really isn't useful. From what I've seen, statistics addresses this best.

Thanks for sharing your experience. I'm happy that my previous exposure to trading algorithms at least helped me understand more what the experts here are talking about. I believe the output model is only as good as the data (at least for the deep learning branch of ML). If the dataset does not cover data-points which exist in a wider space but in the same domain of the problem, or which haven't yet have a precedent, then we really can't simply assume that it is the algo/model that needs tweaking when shit hits the fan.
This is incredibly true, even with crappy old algorithms you can do A LOT if you have great data.

Recent experience with a company that is building some models based on.. few guys recording few hours of audio and annotating it. I still can't get over the fact that otherwise smart people think this is going to work at all.

> but it looks that my data-related advice falls on deaf ears every time, he's only interested in me pointing to him the latest cool ML algorithm.

So, it seems their learning/planning algorithm fails, even when it is given the right data. That's unfortunate.

Sorry, I can't help but notice that you aren't happy with their brain's algorithm, while talking about importance of data. I don't say that data doesn't matter or anything. Just random observation.

Could actually be their data, right? Imagine if you had only had experience with software engineering. The only data you use when engineering software are the data you learn when using the product or writing tests, it's all the algorithms behind it that's important. So to them, they just don't have data on situations where the data are important.

Wow that's confusing wording. I hope it makes sense.

It does, but the algorithm doesn't seems to be state-of-the-art, it's more like current ML algorithms, which need lots of data to work successfully in each new domain. Well, there's a lot of improvement possibilities, at least.
The data processing inequality says processing data does not increase its information content.
But processing does increase the "obviousness" of the information content.

E.g. projecting the data onto independent dimensions doesn't change the information it contains, but it highlights that those dimensions are indeed independent. Decomposing a multimodal distribution into a mixture of unimodal distribution gives more insight than just viewing it as a bunch of data mushed together. And so on.

I think there should be a branch of information theory that quantifies the obviousness of information and how it is changed by various data processing methods.