Hacker News new | ask | show | jobs
by bkanber 3031 days ago
> How did you figure out what features would be useful?

There are various feature engineering and feature extraction techniques. Filter methods, wrapper methods, and embedded methods. Principle component analysis, autoencoding, variance analysis, linear discriminant analysis, Gini index, genetic algorithms, etc -- the feature selection process will depend on the dataset, the problem domain, the analysis algorithm you ultimately use, etc.

> How did you figure out what algorithm(s) are appropriate?

Also depends on the problem domain. Discrete or continuous data? Categorical features, numeric features, features as bitmasks. Do you need a probabilistic outcome? Etc.

Generally you start with the easiest algorithms in your toolbox to see how viable they are. For a classification task I'll almost always start with a naive Bayes classifier (if the data allows) and/or a random forest and see how they perform. If the problem domain is highly non-linear you might start with a support vector or kernel method. Neural network is a last resort for me, as I find most classification problems can be solved to a high accuracy much more simply.

> how and why did you massage the data in a specific way?

This relates back to #1 -- you should only massage data based on what your feature engineering tells you to do. Sometimes you might want to remove outliers or clean up the training data, but only if the outliers really should be removed from consideration entirely.

1 comments

Thanks for the response!

> There are various feature engineering and feature extraction techniques. Filter methods, wrapper methods, and embedded methods. Principle component analysis, autoencoding, variance analysis, linear discriminant analysis, Gini index, genetic algorithms, etc -- the feature selection process will depend on the dataset, the problem domain, the analysis algorithm you ultimately use, etc.

Obviously thats a big toolbox and Im sure it takes time to develop an intuitive understanding for all these techniques. What I hope for is some sort guidebook on what to look for when I stumble across problems. So lets say you try out an algorithm and your accuracy(or whatever evaluation criteria you might have) is low. How do you figure out if thats due to the algorithm, or is it due to (or due to the lack of) feature selection?

An analogy that might be useful is, when I see my database queries are slow, I can use EXPLAIN to guide what knobs to tune. Obviously it requires understanding what indexes are, what a full table scan is etc. etc. but the EXPLAIN plan provides a guidebook of sorts.

Every problem is different, so the only advice I can give is: research research research! Do the hard work up-front; figure out how to describe your problem in a mathematical sense, and identify the right tools to use for the shape of your input, output and problem dimensions. What's the distribution of each dimension. Are the relationships linear, nonlinear, clustered, dispersed, logarithmic, etc. Once you know those things, you're able to narrow in on the right tools and analyses to use.