Hacker News new | ask | show | jobs
by turingbike 2573 days ago
I don't know why it isn't as popular, but CatBoost should be on the list too https://catboost.ai/
3 comments

I tried Catboost when it came out. It should be very popular, as working with categories is where a lot of people seem to fall down in Random Forests.

The 'typical' response is either to make them into numeric variable, so 1-3 for 3 categories, or to make an individual column for each one. The first approach makes sense for ordinals, but not so much for actual categories, and the latter makes it difficult to group categories when a group of two categories together has more predictive capability than any single group. I know that LightGBM did a lot of work in this to optimise testing groups of variables, as testing every possible group in a large set is very intensive.

When I tried Catboost in R, I remember it downloading a large binary to work with, which put me off considerably, and predicting with it was pretty fragile, even for R. I trust Yandex about as much as I'd trust Google, but it seemed 'odd'.

in kaggle, I often turn categorical into numeric and call it a day (even if its not ordinal). I have even found that forcing ordinality (like software versions in the microsoft malware competition) usually makes things worse in hold out.

spending too much time on categoricals is a waste of time, there are other things you can improve in your limited time, and even 'doing the right thing' results in poorer performance in hold out.

catboost is great, it ensembles wonderfully with xgboost. if you find it being fragile, you probably have outliers that need droped - tree algs are really just fancy nearest neighbors so an outlier can ruin predictions considerably.

In general, lgbm trains fast and lets you try many things quickly, but almost always under performs catboost and xgboost. catboost performs really well out of the box and you can generally get results quicker than xgboost, but a well tuned xgboost is usually the best. since xgboost and catboost build trees differently and both perform really well, they make great friends in ensembles.

I have done pretty well on kaggle though I haven't invested much time, top 100 in zillow home price prediction

I think it is actually preferable to start by converting categorical variables to numeric most of the time, even if they are not ordinal. The RF algo can separate off individual classes with 2 splits (e.g. <=7 then >=7) if a single class is very important. The "pool" of features for RF sampling also doesn't get diluted with one hot encoded classes from the one feature.

I am pretty sure I've seen this done successfully in kaggle a bunch before, but don't have any sources on hand for evidence that this method is "better". It does however make it much easier to just throw the data into the RF and check the feature importances to see which features are helping the most.

The only case it struggles with is when the grouping is difficult to achieve in a small amount of splits, such as 1,3,5 against 2,4,6,7, especially when each split will need to show more predictive capability against any of the other column options.
You are right, CatBoost is an amazing algorithm. However, you will be shocked when you will talk with many older senior data scientist, that never heard of Xgboost or LightGBM. CatBoost for them is far too new.
Thank God I came across it in kaggle. Definitely amazing and it also gives easy python export of the model so you can put it as an udf in databases!
Interesting -- could you elaborate on your use case here?
>> as an udf

What exactly is that?

Just a guess on my part, but UserDefinedFunction?
User defined field perhaps?
I looked into the site but I didn't see how CatBoost actually accomplishes its goals compared to other gradient boosting algorithms. Is there a summary somewhere?
Read "CatBoost: unbiased boosting with categorical features". https://arxiv.org/abs/1706.09516