Hacker News new | ask | show | jobs
by tlarkworthy 4722 days ago
Its random forests ... each tree is trained on a subset of the data. You can split the massive dataset into chunks and train independently. That sidesteps the "big data" hangup.

If you look at the implementation for ski-learn, each tree emits a normalised probability vector for each prediction, those vectors are simply multiplied together to get the aggregate prediction, so its not very difficult to do yourself.

Although regardless, you are applying a batch learning technique anyway. You want an incremental learner for big data.

2 comments

The training subset for each tree can still be quite large. Note that most of the implementations failed on their 12 GB dataset.

Although I'm a big believer in streaming/online machine learning, it's not necessarily the best solution. There are many cases when batch is the better option, especially for big data. Anything historical, really.

I was thinking the same thing. Ensemble methods like this scale very well with the number of machines, with moderate efforts of coordination at least.