Hacker News new | ask | show | jobs
by eden_h 2570 days ago
I tried Catboost when it came out. It should be very popular, as working with categories is where a lot of people seem to fall down in Random Forests.

The 'typical' response is either to make them into numeric variable, so 1-3 for 3 categories, or to make an individual column for each one. The first approach makes sense for ordinals, but not so much for actual categories, and the latter makes it difficult to group categories when a group of two categories together has more predictive capability than any single group. I know that LightGBM did a lot of work in this to optimise testing groups of variables, as testing every possible group in a large set is very intensive.

When I tried Catboost in R, I remember it downloading a large binary to work with, which put me off considerably, and predicting with it was pretty fragile, even for R. I trust Yandex about as much as I'd trust Google, but it seemed 'odd'.

2 comments

in kaggle, I often turn categorical into numeric and call it a day (even if its not ordinal). I have even found that forcing ordinality (like software versions in the microsoft malware competition) usually makes things worse in hold out.

spending too much time on categoricals is a waste of time, there are other things you can improve in your limited time, and even 'doing the right thing' results in poorer performance in hold out.

catboost is great, it ensembles wonderfully with xgboost. if you find it being fragile, you probably have outliers that need droped - tree algs are really just fancy nearest neighbors so an outlier can ruin predictions considerably.

In general, lgbm trains fast and lets you try many things quickly, but almost always under performs catboost and xgboost. catboost performs really well out of the box and you can generally get results quicker than xgboost, but a well tuned xgboost is usually the best. since xgboost and catboost build trees differently and both perform really well, they make great friends in ensembles.

I have done pretty well on kaggle though I haven't invested much time, top 100 in zillow home price prediction

I think it is actually preferable to start by converting categorical variables to numeric most of the time, even if they are not ordinal. The RF algo can separate off individual classes with 2 splits (e.g. <=7 then >=7) if a single class is very important. The "pool" of features for RF sampling also doesn't get diluted with one hot encoded classes from the one feature.

I am pretty sure I've seen this done successfully in kaggle a bunch before, but don't have any sources on hand for evidence that this method is "better". It does however make it much easier to just throw the data into the RF and check the feature importances to see which features are helping the most.

The only case it struggles with is when the grouping is difficult to achieve in a small amount of splits, such as 1,3,5 against 2,4,6,7, especially when each split will need to show more predictive capability against any of the other column options.