Hacker News new | ask | show | jobs
by sh33mp 2974 days ago
A funny thing about decision trees (or random forests) is how conceptually simple they are, but in terms of implementation they're very non-trivial.

There's always a point in the lecture or explanation where they go

So we just find the optimal split/feature based on entropy

which no one talks a ton about, but naively implemented is something on the order of O(kNlogN). For each split. Multiply that by the number of leaves (2^depth), and multiply that by the number of trees in your forest.

I learned this the hard way when I tried implementing random forests on GPU for a class (would not recommend: efficiently forming decision trees seem to involve a lot of data copying and shifting around). I actually learned a lot from reading sklearn's implementation of decision trees in Cython - it uses quite a number of neat tricks to make things really fast.

3 comments

For optimizations, have a look at section 3 of the XGBoost paper: https://arxiv.org/pdf/1603.02754.pdf. LightGBM has similar features (e.g. binning: http://lightgbm.readthedocs.io/en/latest/Parameters.html#io-...).

Scikit is shockingly slow, in comparison. Also bloated, but that's more a matter of 1) not having a "release" impl that ditches data only useful for debugging, and 2) using 64-bit data types all over the place, despite running in parallel arrays! (https://github.com/scikit-learn/scikit-learn/blob/master/skl...)

In random forests you don't really need to find the optimal split. You usually generate a number of random splits and select the best one.
I believe the formulation of random forests requires you to find the optimal split, albeit over a subset of features.

What you're talking about, where you simply generate a set of random splits across features, is Extremely Randomized Trees (https://link.springer.com/article/10.1007%2Fs10994-006-6226-...).

Since we're splitting hairs instead of training sets: Classical RFs select one random feature and choose the optimal split, whereas Extremely RTs choose the best feature (out of a random subset) whereby for each feature only one random split is tested.

Another difference is that RFs use a bootstrapped dataset and ERTs use the full dataset.

Ah, I was under the impression that RFs choose from a subset of features, not just one feature.

In any case, I agree with the thrust of your original comment that the specifications of the RF algorithm can be relaxed, usually for performance reasons, and still retain strong performance. But this goes back to my original comment that the performance considerations of random forests often aren't highlighted to new learners (whereas introducing ERTs to a beginner would probably shock them - how could you take totally random splits and still get any reasonable performance!)

> Ah, I was under the impression that RFs choose from a subset of features, not just one feature.

You are correct. For classification the usual rule of thumb is to select square root of the number of features.

Thanks for sharing. I will take a look at Cython implementation, that should help me learn a few things.