| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by selectron 3617 days ago

1) It depends heavily on the model. Something like xgboost (gradient boosted decision tree) will handle irrelevant features fairly well, while other models (like linear models, especially without lasso regularization) will have much more trouble. In virtually all cases adding noise will decrease model performance.

2) Same as 1), depends on the model. With good hyper-parameters xgboost can handle correlated features well, while other models may struggle.

3) With a good model (again like xgboost), feature engineering is usually the best use of your time. Removing "bad" labels and "noise" in the data is especially dangerous, as if you are not extremely careful you can make your model worse. If you can identify why the label is "bad" then you can remove or correct it, but you need a reason why you wouldn't have these bad labels on your test dataset. Removing outliers can help your model, but it is risky. In contrast smart feature engineering is low risk and can provide large gains if you see a pattern the model could not see. Feature selection can be important as well, and is generally pretty quick assuming you have good hardware, so you might as well do it, especially if you have some knowledge about which features you expect to be not that useful.