"Similar performance" still means 30%-50% slower [1] and half the RAM, not really that comparable.
For much closer performance you should get a 2080ti, which should be roughly comparable in speed and have 11GB [edit: wrongly wrote 14GB before] of memory (against the 16GB for the V100). Price-wise you still save a lot of money, after quickly googling around, roughly $1200 vs. $15k-$20k.
But you still lose something, e.g. if you use half precision on V100 you get virtually double speed, if you do on a 1080 / 2080 you get... nothing because it's not supported.
(and more importantly for companies, you can actually use only V100-style stuff on servers [edit: as you mentioned already, although I'm not 100% sure it's just drivers that are the issue?])
[1] I've not used 1080 myself, but I've used 1080ti and V100 extensively, and the latter is about 30% faster. Hence my estimate for comparison with 1080
For my workload (optical flow) I was honestly surprised to see that the Google Cloud V100 was not faster than my local GTX 1080. So I guess that varies a lot by how you're training, too.
For many of my AI training workloads, already the 1080 is "fast enough" and the CPU or SSDs are the bottleneck. In that case, GPU doesn't really matter that much.
Yes that might be the case. In my case I mostly trained big (tens to hundreds of millions of parameters) networks mostly made of 3x3 convolutions, and I think the V100 has dedicated hardware for that. Then as I mentioned you can get a further 2x speedup by using half precision.
If you train smaller models, or RNN, you probably lose most of the gains of dedicated hardware. But I guess that for this same reason the experiments in the article are little more than a provocation, I don't know if you could train a big network in finite time on M1 chips...
That said, of course, if the budget was mine, I wouldn't buy a V100 :-)
> But you still lose something, e.g. if you use half precision on V100 you get virtually double speed, if you do on a 1080 / 2080 you get... nothing because it's not supported.
There's one for PyTorch, I tested it about a year ago. You have to compile it from scratch and IIRC it translates/compile CUDA to ROCm at runtime which causes noticeable pauses on the first run. There may be other tweaks you have to do too. Once set up it performs decently, though.
> The special thing about the V100 is that it's driver EULA allows data center usage.
Wait what? Is it the only thing?
That sounds hard to believe: if true, using the open driver (Nouveau) instead of Nvidia's proprietary one would be a massive money saver for datacenters operators (and even if Nouveau doesn't support the features you'd want already, supporting their development would be much cheaper for a company like Amazon than paying a premium on every GPU they buy)
Other characteristics of V100 that may be interesting to people buying GPUs for data centers:
- higher capacity GPU memory. 1080 has 8 GB, V100 has 16 or 32 GB.
- higher bandwidth GPU memory. V100 has HBM2 with a peak of 900 GB/s, 1080 has G5X with a peak of ~300 GB/s.
- ECC support.
- data center certification + warranty
(The geforce warranty covers normal consumer usage, like gaming, and does not cover datacenter use)
- availability of enterprise support contracts.
(If you are buying a ton of GPUs to put in a datacenter, you probably don't want to end up on the normal consumer support line when something goes wrong)
A GTX1080 manages about ~9 TFLOPS(fp32) (and has terrible fp16 support), where V100 gets ~15 TFLOPS(fp16), ~30 TFLOPS(fp16), and ~120 TFLOPS(tensor cores).
Apart from one being a gaming product and the other being designed for computational tasks, they're a generation apart and have various small differences that may be quite relevant for individual tasks (such as V100 allowing twice the shared memory - 96 KiB - per thread block)
I bought from a (relatively) small German commerce site[1] rather than a bigger site like Amazon, OCUK, or Scan. I'm in EU though, probably doesn't help if you're US. I think I paid a €50 or so premium over the retail price but I didn't mind that too much.
I used this[2] site to keep an eye open for stock, as you can see it's pretty much empty now but I just checked every day and finally found one.
If you properly utilize your hardware, on premise (or colocation in an area with cheap electricity prices) is vastly cheaper and will likely continue to be for a while. I don't see how training models in the cloud makes financial sense for organizations that can utilize their hardware 24/7.
For all others with burst workloads training in the cloud can make sense, but that has been the case for a while already.
We're not talking about organizations, though. I don't agree with your premise, either. People aren't training models 24/7, so the idea that it's "vastly cheaper and will continue to be for a while" isn't true.
... uh, you sure about that? Let me go check on the 3 models I have concurrently training for my organization on 3 separate GPU servers (all 2 year old hardware to boot) that have been running continuously for the past 36 hours. It pretty much works out to 24/7 training for the past several months.
And BTW, this is massively cheaper for us than training in the cloud.
Instead of arguing back and forth, how about a test case instead?
Pretraining BERT takes 44 minutes on 1024 V100 GPUs [1]
This requires dedicated instances, since shared instances won't be able to get to peak performance if only because of the "noisy neighbour"-effect.
At GCP, a V100 costs $2.48/h [2], so Microsoft's experiment would've cost $2,539.52.
Smaller providers offer the same GPU at just $1.375/h [3], so a reasonable lower limit would be around $1,408.
For a single BERT pretraining, provided highly optimised workflows and distributed training scripts are already at hand, renting a GPU for single training tasks seems to be the way to go.
The cost of V100-equivalent end-user hardware (we don't need to run in a datacentre, dedicated workstations will do), is about $6,000 (e.g. a Quadro RTX 6000), provided you don't need double precision. The card will have equal FP32 performance, lower TGP and VRAM that sits between the 16 GB and 32 GB version of the V100.
Workstation hardware to go with such card will cost about $2,000, so $8,000 are a reasonable cost estimation.
The cost of electricity varies between regions, but in the EU the average non-household price is about 0.13€/kWh [4].
Pretraining BERT therefore costs an estimated 1024 h * 0.13€/kWh * 0.5 kW ≈ 57€ in electricity (power consumption estimated from TGP + typical power consumptions of an Intel Xeon workstation from my own measurements when training models).
In order to get the break-even point we can use the following equation: t * $1,408 = $8,000 + t * $69, which results in t = 8,000/(1408-69) or t > 5.
In short, if you pretrain BERT 6 times, you safe money by BUYING a workstation and running it locally over renting cloud GPUs from a reasonably cheap provider.
This example only concerns BERT, but you can use the same reasoning for any model that you know the required compute time and VRAM requirements of.
This only concerns training, too - inference is a whole different can of worms entirely.
The special thing about the V100 is that it's driver EULA allows data center usage. If you don't need that, there are other much cheaper options.