| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tlarkworthy 4722 days ago

Its random forests ... each tree is trained on a subset of the data. You can split the massive dataset into chunks and train independently. That sidesteps the "big data" hangup.

If you look at the implementation for ski-learn, each tree emits a normalised probability vector for each prediction, those vectors are simply multiplied together to get the aggregate prediction, so its not very difficult to do yourself.

Although regardless, you are applying a batch learning technique anyway. You want an incremental learner for big data.

2 comments

msellout 4722 days ago

The training subset for each tree can still be quite large. Note that most of the implementations failed on their 12 GB dataset.

Although I'm a big believer in streaming/online machine learning, it's not necessarily the best solution. There are many cases when batch is the better option, especially for big data. Anything historical, really.

link

bayesianhorse 4722 days ago

I was thinking the same thing. Ensemble methods like this scale very well with the number of machines, with moderate efforts of coordination at least.

link