> thereby doing a full scan of the data for each iteration to calculate the optimal split point
> (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)
> The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory
what's stored in memory is an implementation detail. the key idea is that the tree algorithms are choosing an optimal based on the entire dataset, where sgd is working on small randomly chosen batches. turning off batching means computing gradients on the entire dataset instead.
although the typical bottleneck in gpu computing is moving data to and from the gpu's workarea (which is probably why you mention memory), there is nothing theoretical that says these computations could not be implemented in a streaming manner.
The paper talks about whole dataset training and one of the datasets used is HIGGS [2]. The figure 2(b) shows two whole dataset training approaches (L-BFGS and ADMM) vs SGD. SGD tops at the accuracy with which both whole dataset approaches start, basically.
HIGGS is strange dataset. It is narrow, having only 29 features. It is also relatively long, about 11M samples (10M to train, 0.5M to validate and last 0.5M to test). It is also hard to get right with SGD.
But if you perform whole dataset optimization, even linear regression can get you good accuracy [3] (some experiments of mine).
> (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)
> The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory
what's stored in memory is an implementation detail. the key idea is that the tree algorithms are choosing an optimal based on the entire dataset, where sgd is working on small randomly chosen batches. turning off batching means computing gradients on the entire dataset instead.
although the typical bottleneck in gpu computing is moving data to and from the gpu's workarea (which is probably why you mention memory), there is nothing theoretical that says these computations could not be implemented in a streaming manner.