| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by drzoltar 1422 days ago
	I think another aspect is that most modern GBT models prefer the entire dataset to be in memory, thereby doing a full scan of the data for each iteration to calculate the optimal split point. That’s hard to compete with if your batch size is small in a NN model.

2 comments

a-dub 1422 days ago

that's an interesting idea. but at the end of the paper they do an analysis of the effect of different hyperparameters for the nets with their dataset and find that the batch size doesn't seem to matter much. (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)

link

alexcnwy 1422 days ago

The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory

link

a-dub 1422 days ago

> thereby doing a full scan of the data for each iteration to calculate the optimal split point

> (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)

> The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory

what's stored in memory is an implementation detail. the key idea is that the tree algorithms are choosing an optimal based on the entire dataset, where sgd is working on small randomly chosen batches. turning off batching means computing gradients on the entire dataset instead.

although the typical bottleneck in gpu computing is moving data to and from the gpu's workarea (which is probably why you mention memory), there is nothing theoretical that says these computations could not be implemented in a streaming manner.

link

thesz 1418 days ago

Oh, you touched my favorite topic of whole dataset training.

Take a look at [1] and go straight to the page 8, figure 2(b).

[1] http://proceedings.mlr.press/v48/taylor16.pdf

The paper talks about whole dataset training and one of the datasets used is HIGGS [2]. The figure 2(b) shows two whole dataset training approaches (L-BFGS and ADMM) vs SGD. SGD tops at the accuracy with which both whole dataset approaches start, basically.

[2] https://archive.ics.uci.edu/ml/datasets/HIGGS#

HIGGS is strange dataset. It is narrow, having only 29 features. It is also relatively long, about 11M samples (10M to train, 0.5M to validate and last 0.5M to test). It is also hard to get right with SGD.

But if you perform whole dataset optimization, even linear regression can get you good accuracy [3] (some experiments of mine).

[3] https://github.com/thesz/higgs-logistic-regression

link

anothernewdude 1421 days ago

they also do subsampling of the data though.

link