|
|
|
|
|
by eden_h
2570 days ago
|
|
I tried Catboost when it came out. It should be very popular, as working with categories is where a lot of people seem to fall down in Random Forests. The 'typical' response is either to make them into numeric variable, so 1-3 for 3 categories, or to make an individual column for each one. The first approach makes sense for ordinals, but not so much for actual categories, and the latter makes it difficult to group categories when a group of two categories together has more predictive capability than any single group. I know that LightGBM did a lot of work in this to optimise testing groups of variables, as testing every possible group in a large set is very intensive. When I tried Catboost in R, I remember it downloading a large binary to work with, which put me off considerably, and predicting with it was pretty fragile, even for R. I trust Yandex about as much as I'd trust Google, but it seemed 'odd'. |
|
spending too much time on categoricals is a waste of time, there are other things you can improve in your limited time, and even 'doing the right thing' results in poorer performance in hold out.
catboost is great, it ensembles wonderfully with xgboost. if you find it being fragile, you probably have outliers that need droped - tree algs are really just fancy nearest neighbors so an outlier can ruin predictions considerably.
In general, lgbm trains fast and lets you try many things quickly, but almost always under performs catboost and xgboost. catboost performs really well out of the box and you can generally get results quicker than xgboost, but a well tuned xgboost is usually the best. since xgboost and catboost build trees differently and both perform really well, they make great friends in ensembles.
I have done pretty well on kaggle though I haven't invested much time, top 100 in zillow home price prediction