Hacker News new | ask | show | jobs
by redditmigrant 3031 days ago
As someone who is trying to learn ML, all the courses available are hugely helpful. One thing I wish I had easy access to is the process that someone goes through while trying to build a model on a real dataset.

Specifically following questions are the ones I struggle with:

1. How did you figure out what features would be useful?

2. How did you figure out what algorithm(s) are appropriate?

3. how and why did you massage the data in a specific way?

4 comments

> How did you figure out what features would be useful?

There are various feature engineering and feature extraction techniques. Filter methods, wrapper methods, and embedded methods. Principle component analysis, autoencoding, variance analysis, linear discriminant analysis, Gini index, genetic algorithms, etc -- the feature selection process will depend on the dataset, the problem domain, the analysis algorithm you ultimately use, etc.

> How did you figure out what algorithm(s) are appropriate?

Also depends on the problem domain. Discrete or continuous data? Categorical features, numeric features, features as bitmasks. Do you need a probabilistic outcome? Etc.

Generally you start with the easiest algorithms in your toolbox to see how viable they are. For a classification task I'll almost always start with a naive Bayes classifier (if the data allows) and/or a random forest and see how they perform. If the problem domain is highly non-linear you might start with a support vector or kernel method. Neural network is a last resort for me, as I find most classification problems can be solved to a high accuracy much more simply.

> how and why did you massage the data in a specific way?

This relates back to #1 -- you should only massage data based on what your feature engineering tells you to do. Sometimes you might want to remove outliers or clean up the training data, but only if the outliers really should be removed from consideration entirely.

Thanks for the response!

> There are various feature engineering and feature extraction techniques. Filter methods, wrapper methods, and embedded methods. Principle component analysis, autoencoding, variance analysis, linear discriminant analysis, Gini index, genetic algorithms, etc -- the feature selection process will depend on the dataset, the problem domain, the analysis algorithm you ultimately use, etc.

Obviously thats a big toolbox and Im sure it takes time to develop an intuitive understanding for all these techniques. What I hope for is some sort guidebook on what to look for when I stumble across problems. So lets say you try out an algorithm and your accuracy(or whatever evaluation criteria you might have) is low. How do you figure out if thats due to the algorithm, or is it due to (or due to the lack of) feature selection?

An analogy that might be useful is, when I see my database queries are slow, I can use EXPLAIN to guide what knobs to tune. Obviously it requires understanding what indexes are, what a full table scan is etc. etc. but the EXPLAIN plan provides a guidebook of sorts.

Every problem is different, so the only advice I can give is: research research research! Do the hard work up-front; figure out how to describe your problem in a mathematical sense, and identify the right tools to use for the shape of your input, output and problem dimensions. What's the distribution of each dimension. Are the relationships linear, nonlinear, clustered, dispersed, logarithmic, etc. Once you know those things, you're able to narrow in on the right tools and analyses to use.
If you are willing to do the work, Frank Harrell's Regression Modeling Strategies is a pretty good introduction to a lot of this.

It's written for a very different set of problems than typical ML, but it has lots of really good advice for practical problems in data analysis and prediction (which is another term for ML).

Mostly people learn this stuff by experience. Find a dataset, choose a predictor, filter, clean and massage your data till you get better metrics/understanding (preferably both). Rinse, repeat on many different datasets and problems, and you'll know how to do this.

Georgia Tech has an graduate course on Machine Learning CS-7641. There are four major projects in that course where the students must analyze (and re-analyze) a chosen dataset. Here is an example of the code one student used: https://github.com/JonathanTay/CS-7641-assignment-1 Unfortunately all the plotting code was intentionally removed. Sometimes the project reports make it online (http://www.dudonwai.com/docs/gt-omscs-cs7641-a3.pdf?pdf=gt-o...) . Having spent several months of my life on the assignments I'd say that only way to learn it is to try a whole bunch of different things and try and figure out why some work and why some don't. Sometimes you learn from the failures, sometimes from the unexpected successes.
Take a look at some of the highly rated kernels on Kaggle - they’re often well annotated with the types of things you’re looking for, including actual experimentation to test ideas.

Edit: fix autocorrect