Hacker News new | ask | show | jobs
by eden_h 2577 days ago
This is a very thinly disguised advert for the author's product, and doesn't really advance on the benefits of either approach, as it doesn't go into any depth on why Random Forests/NNs are applicable to each type of data provided.

They're both generalised solvers, but default Random Forests aren't the most common Forest these days - LightGBM/XGBoost are both using Gradient Boosted Forests by default, which would be a much more interesting comparison to a NN's Gradient Boosting.

3 comments

I don't know why it isn't as popular, but CatBoost should be on the list too https://catboost.ai/
I tried Catboost when it came out. It should be very popular, as working with categories is where a lot of people seem to fall down in Random Forests.

The 'typical' response is either to make them into numeric variable, so 1-3 for 3 categories, or to make an individual column for each one. The first approach makes sense for ordinals, but not so much for actual categories, and the latter makes it difficult to group categories when a group of two categories together has more predictive capability than any single group. I know that LightGBM did a lot of work in this to optimise testing groups of variables, as testing every possible group in a large set is very intensive.

When I tried Catboost in R, I remember it downloading a large binary to work with, which put me off considerably, and predicting with it was pretty fragile, even for R. I trust Yandex about as much as I'd trust Google, but it seemed 'odd'.

in kaggle, I often turn categorical into numeric and call it a day (even if its not ordinal). I have even found that forcing ordinality (like software versions in the microsoft malware competition) usually makes things worse in hold out.

spending too much time on categoricals is a waste of time, there are other things you can improve in your limited time, and even 'doing the right thing' results in poorer performance in hold out.

catboost is great, it ensembles wonderfully with xgboost. if you find it being fragile, you probably have outliers that need droped - tree algs are really just fancy nearest neighbors so an outlier can ruin predictions considerably.

In general, lgbm trains fast and lets you try many things quickly, but almost always under performs catboost and xgboost. catboost performs really well out of the box and you can generally get results quicker than xgboost, but a well tuned xgboost is usually the best. since xgboost and catboost build trees differently and both perform really well, they make great friends in ensembles.

I have done pretty well on kaggle though I haven't invested much time, top 100 in zillow home price prediction

I think it is actually preferable to start by converting categorical variables to numeric most of the time, even if they are not ordinal. The RF algo can separate off individual classes with 2 splits (e.g. <=7 then >=7) if a single class is very important. The "pool" of features for RF sampling also doesn't get diluted with one hot encoded classes from the one feature.

I am pretty sure I've seen this done successfully in kaggle a bunch before, but don't have any sources on hand for evidence that this method is "better". It does however make it much easier to just throw the data into the RF and check the feature importances to see which features are helping the most.

The only case it struggles with is when the grouping is difficult to achieve in a small amount of splits, such as 1,3,5 against 2,4,6,7, especially when each split will need to show more predictive capability against any of the other column options.
You are right, CatBoost is an amazing algorithm. However, you will be shocked when you will talk with many older senior data scientist, that never heard of Xgboost or LightGBM. CatBoost for them is far too new.
Thank God I came across it in kaggle. Definitely amazing and it also gives easy python export of the model so you can put it as an udf in databases!
Interesting -- could you elaborate on your use case here?
>> as an udf

What exactly is that?

Just a guess on my part, but UserDefinedFunction?
User defined field perhaps?
I looked into the site but I didn't see how CatBoost actually accomplishes its goals compared to other gradient boosting algorithms. Is there a summary somewhere?
Read "CatBoost: unbiased boosting with categorical features". https://arxiv.org/abs/1706.09516
With the current low cost of cloud computing, there's no reason not to just try everything and see what happens (which is why AutoML has become more popular).

It's more pragmatic than trying to rationalize which framework is "best" for a given dataset, as the results are often counterintuitive.

On the contrary, I think this is one of the biggest emerging blockades to progress in ML/AI research, especially in academia. It has always been more cost-effective to run ML algorithms on consumer HW such as GeForce GPUs and gaming CPUs. It's frequently even faster than contemporary cloud offerings when the consumer HW gets ahead of existing enterprise HW. And it's so effective that HW companies starting changing their EULAs and crippling previously available aspects of APIs to herd AI back into the datacenter where they seem to think it belongs.

And that IMO is a reinvention of the "Walled Garden" of academic HPC (ask any grad student begging and pleading for supercomputer time) which has always sucked and its new commercial incarnation is even worse because it's unclear how to get commercial cloud time on government grants.

OTOH it's fine for large shops like OpenAI, DeepMind, AWS AI, FAIR, MS Research etc because they have deep deep pockets. So if you're content with most future groundbreaking research coming from a small tribe of market leaders, well great, but I suspect innovation is already slowing down because of this.

this approach is also very inefficient- autoML takes hours searching for parameters when you could build the same model manually in a fraction of the time.
There's no reason you can't do both: prototype a simple model to get a baseline performance, then use AutoML to fine-tune it, plus you now have a value to sanity check against.

Those hours of hyperparameter search aren't blocking. You can do other things while it's searching, or do the search when not actively using the resources (e.g. overnight).

It depends on what algorithms you use in AutoML. If you decide to use simple algorithms: logistic regression, decision tree, random forest then you will have a simple model very quickly. Using Neural Networks in AutoML requires much more computational resources.
if you know optimal hyperparameters from the beginning, then yes, you can build such model manually. But in most cases this is not the case.
I don't like such brute-force approach. Even if you have low cost of computing the number of possible combinations of hyperparameters is huge! Google Cloud AutoML Tables solution cost is 20 USD per 1 hour of computing (I guess that's because of inefficient Neural Architecture Search algorithm). Running few ML experiment can easily end with huge bill.
For independent ML hobbyists, sure. But if you're a business which can get a positive ROI from it, it's well worth the time.
It depends. Complexity creeps up easily in ML systems. A kitchen-sink of gazillions of algorithms (and code bases) in an ensemble creates a very brittle system I wouldn't want to deal with.
Not to mention the article is full of grammar and spelling errors. Not a great look if you are trying to promote your product.