Hacker News new | ask | show | jobs
by Buttons840 3259 days ago
> focus a lot of his time on the data itself... from where he intends to collect it? how is it structured? is it reliable? is it "enough"?

What's the best books on this subject? I suppose it's a very broad topic and thus more difficult to talk about than a single "neural network" algorithm.

3 comments

Interested in what part of that you feel needs to be explained in more depth? Not sure reading several books is necessary for explaining data collection and data munging...to me it's definitely something best learned by doing.

work in data analysis/stats

Lots of things are best learned by doing. I just noticed there are dozens of books about machine learning algorithms but none on how to gather data. Of course, both those things can be learned independently, but I think there's room for at least a few books about data gathering considering it's so important for good machine learning results.
Here at Manning (we're publishing Francois Book) have something in our early access program on this now - https://www.manning.com/books/the-art-of-data-usability
This is the domain of statistics, isn't it?
Agreed. AFAIK, only statistics has addressed the question of info sufficiency in data and discriminative power of method. Personally, I think the former is an enormously important subject that isn't addressed well in most ML texts. How much data is necessary to answer a given question in practice? How do you know if your data or method are "good enough"?

From what I've seen, statistics addresses these questions better than CS-taught ML does. CS-based ML is no different from algorithm analysis; it suffers from sensitivity to limits inherent in the data. But ML courses often don't address these limits very rigorously. Yet knowing those limits is all important when effectively mining information at a professional level.

If you can't tell the decision maker what you know and what you don't, your inference/prediction really isn't useful. From what I've seen, statistics addresses this best.