| HN Mirror

A GPU database isn't that useful, because the arithmetic intensity (ops/byte) is relatively low. Cross-sectional memory bandwidth is what really matters; you can get similar effects with a cluster of CPU machines provisioned appropriately, with a shard or a replica of the database on each CPU machine. I say this as someone who has written a GPU in-memory database of sorts that is used at Facebook (Faiss), but what is interesting if you can tie that to something that has higher arithmetic intensity before or after the database lookup on the GPU.

GPUs are only really being used for machine learning due to the sequential dependence of SGD and the relatively high arithmetic intensity (flops/byte) of convolutions or certain GEMMs. The faster you can take a gradient descent step means the faster wall clock time to converge, and you would lose by limiting memory reuse (for conv/GEMM) or on communication overhead or latency if you attempt to split a single computation between multiple nodes. The Volta "tensor cores" (fp16 units) make the GPU less arithmetic bound for operations such as convolution that require a GEMM-like operation, but the fact that the memory bandwidth did not increase by a similar factor means that Volta is fairly unbalanced.

The point about Intel not increasing their headline performance by as much as GPUs is also misleading. Intel CPUs are very good at branchy codes and are latency optimized, not throughput optimized (as far as a general purpose computer can be). Not everything we want to do, even in deep learning, will necessarily run well on a throughput-optimized machine.