|
|
|
|
|
by redditmigrant
3031 days ago
|
|
As someone who is trying to learn ML, all the courses available are hugely helpful. One thing I wish I had easy access to is the process that someone goes through while trying to build a model on a real dataset. Specifically following questions are the ones I struggle with: 1. How did you figure out what features would be useful? 2. How did you figure out what algorithm(s) are appropriate? 3. how and why did you massage the data in a specific way? |
|
There are various feature engineering and feature extraction techniques. Filter methods, wrapper methods, and embedded methods. Principle component analysis, autoencoding, variance analysis, linear discriminant analysis, Gini index, genetic algorithms, etc -- the feature selection process will depend on the dataset, the problem domain, the analysis algorithm you ultimately use, etc.
> How did you figure out what algorithm(s) are appropriate?
Also depends on the problem domain. Discrete or continuous data? Categorical features, numeric features, features as bitmasks. Do you need a probabilistic outcome? Etc.
Generally you start with the easiest algorithms in your toolbox to see how viable they are. For a classification task I'll almost always start with a naive Bayes classifier (if the data allows) and/or a random forest and see how they perform. If the problem domain is highly non-linear you might start with a support vector or kernel method. Neural network is a last resort for me, as I find most classification problems can be solved to a high accuracy much more simply.
> how and why did you massage the data in a specific way?
This relates back to #1 -- you should only massage data based on what your feature engineering tells you to do. Sometimes you might want to remove outliers or clean up the training data, but only if the outliers really should be removed from consideration entirely.