Hacker News new | ask | show | jobs
by drzoltar 1422 days ago
I think another aspect is that most modern GBT models prefer the entire dataset to be in memory, thereby doing a full scan of the data for each iteration to calculate the optimal split point. That’s hard to compete with if your batch size is small in a NN model.
2 comments

that's an interesting idea. but at the end of the paper they do an analysis of the effect of different hyperparameters for the nets with their dataset and find that the batch size doesn't seem to matter much. (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)
The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory
> thereby doing a full scan of the data for each iteration to calculate the optimal split point

> (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)

> The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory

what's stored in memory is an implementation detail. the key idea is that the tree algorithms are choosing an optimal based on the entire dataset, where sgd is working on small randomly chosen batches. turning off batching means computing gradients on the entire dataset instead.

although the typical bottleneck in gpu computing is moving data to and from the gpu's workarea (which is probably why you mention memory), there is nothing theoretical that says these computations could not be implemented in a streaming manner.

Oh, you touched my favorite topic of whole dataset training.

Take a look at [1] and go straight to the page 8, figure 2(b).

[1] http://proceedings.mlr.press/v48/taylor16.pdf

The paper talks about whole dataset training and one of the datasets used is HIGGS [2]. The figure 2(b) shows two whole dataset training approaches (L-BFGS and ADMM) vs SGD. SGD tops at the accuracy with which both whole dataset approaches start, basically.

[2] https://archive.ics.uci.edu/ml/datasets/HIGGS#

HIGGS is strange dataset. It is narrow, having only 29 features. It is also relatively long, about 11M samples (10M to train, 0.5M to validate and last 0.5M to test). It is also hard to get right with SGD.

But if you perform whole dataset optimization, even linear regression can get you good accuracy [3] (some experiments of mine).

[3] https://github.com/thesz/higgs-logistic-regression

they also do subsampling of the data though.