| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by perturbation 2629 days ago

AutoML is essentially training a ML model using some heuristics or optimization algorithm to select model architecture and train a model. Feature engineering / feature synthesis as well as interpretability remain open challenges.

If I'm understanding your questions correctly, the main problems I see with this are:

- Using raw data instead of feature engineering (less of a problem given feature synthesis libraries like https://www.featuretools.com/ and other heuristic methods). I'd expect Google to do a good job of basic things like normalization of raw input features before training.

- Using features that it really shouldn't (if you just throw ML at your database for say, loan applications, then sensitive / personally identifying information can/will be used as features)

- Lack of insight / understanding as to what is driving the model. This can be partially overcome with post-training methods like LIME, Shapley values, etc.

I wouldn't expect predictions to be from a set of discrete values - if (say) predicting housing values and training a NN, the output should be continuous and based on the input features.

1 comments

mritchie712 2629 days ago

Another common error I see is timing (e.g. using data from the "future" to predict an event). To build on your loan example, if you inadvertently included the current FICO score of an applicant that applied 12 months ago, it will be unfairly correlated with the loans current performance.

link

kmax12 2627 days ago

This is very important! If you use Featuretools, we provide a mechanism to avoid this very problem. See how we handle time in our documentation here: https://docs.featuretools.com/automated_feature_engineering/...

link