| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by turingbike 2573 days ago
	I don't know why it isn't as popular, but CatBoost should be on the list too https://catboost.ai/

3 comments

eden_h 2573 days ago

I tried Catboost when it came out. It should be very popular, as working with categories is where a lot of people seem to fall down in Random Forests.

The 'typical' response is either to make them into numeric variable, so 1-3 for 3 categories, or to make an individual column for each one. The first approach makes sense for ordinals, but not so much for actual categories, and the latter makes it difficult to group categories when a group of two categories together has more predictive capability than any single group. I know that LightGBM did a lot of work in this to optimise testing groups of variables, as testing every possible group in a large set is very intensive.

When I tried Catboost in R, I remember it downloading a large binary to work with, which put me off considerably, and predicting with it was pretty fragile, even for R. I trust Yandex about as much as I'd trust Google, but it seemed 'odd'.

link

autokad 2573 days ago

in kaggle, I often turn categorical into numeric and call it a day (even if its not ordinal). I have even found that forcing ordinality (like software versions in the microsoft malware competition) usually makes things worse in hold out.

spending too much time on categoricals is a waste of time, there are other things you can improve in your limited time, and even 'doing the right thing' results in poorer performance in hold out.

catboost is great, it ensembles wonderfully with xgboost. if you find it being fragile, you probably have outliers that need droped - tree algs are really just fancy nearest neighbors so an outlier can ruin predictions considerably.

In general, lgbm trains fast and lets you try many things quickly, but almost always under performs catboost and xgboost. catboost performs really well out of the box and you can generally get results quicker than xgboost, but a well tuned xgboost is usually the best. since xgboost and catboost build trees differently and both perform really well, they make great friends in ensembles.

I have done pretty well on kaggle though I haven't invested much time, top 100 in zillow home price prediction

link

ScoutOrgo 2573 days ago

I think it is actually preferable to start by converting categorical variables to numeric most of the time, even if they are not ordinal. The RF algo can separate off individual classes with 2 splits (e.g. <=7 then >=7) if a single class is very important. The "pool" of features for RF sampling also doesn't get diluted with one hot encoded classes from the one feature.

I am pretty sure I've seen this done successfully in kaggle a bunch before, but don't have any sources on hand for evidence that this method is "better". It does however make it much easier to just throw the data into the RF and check the feature importances to see which features are helping the most.

link

eden_h 2573 days ago

The only case it struggles with is when the grouping is difficult to achieve in a small amount of splits, such as 1,3,5 against 2,4,6,7, especially when each split will need to show more predictive capability against any of the other column options.

link

pplonski86 2573 days ago

You are right, CatBoost is an amazing algorithm. However, you will be shocked when you will talk with many older senior data scientist, that never heard of Xgboost or LightGBM. CatBoost for them is far too new.

link

ramraj07 2573 days ago

Thank God I came across it in kaggle. Definitely amazing and it also gives easy python export of the model so you can put it as an udf in databases!

link

vladf 2568 days ago

Interesting -- could you elaborate on your use case here?

link

misterman0 2573 days ago

>> as an udf

What exactly is that?

link

kurthr 2573 days ago

Just a guess on my part, but UserDefinedFunction?

link

sheeshkebab 2573 days ago

User defined field perhaps?

link

cheez 2573 days ago

I looked into the site but I didn't see how CatBoost actually accomplishes its goals compared to other gradient boosting algorithms. Is there a summary somewhere?

link

sanxiyn 2573 days ago

Read "CatBoost: unbiased boosting with categorical features". https://arxiv.org/abs/1706.09516

link