|
|
|
|
|
by sh33mp
2974 days ago
|
|
A funny thing about decision trees (or random forests) is how conceptually simple they are, but in terms of implementation they're very non-trivial. There's always a point in the lecture or explanation where they go So we just find the optimal split/feature based on entropy which no one talks a ton about, but naively implemented is something on the order of O(kNlogN). For each split. Multiply that by the number of leaves (2^depth), and multiply that by the number of trees in your forest. I learned this the hard way when I tried implementing random forests on GPU for a class (would not recommend: efficiently forming decision trees seem to involve a lot of data copying and shifting around). I actually learned a lot from reading sklearn's implementation of decision trees in Cython - it uses quite a number of neat tricks to make things really fast. |
|
Scikit is shockingly slow, in comparison. Also bloated, but that's more a matter of 1) not having a "release" impl that ditches data only useful for debugging, and 2) using 64-bit data types all over the place, despite running in parallel arrays! (https://github.com/scikit-learn/scikit-learn/blob/master/skl...)