Hacker News new | ask | show | jobs
by qayxc 2171 days ago
> Now it is affordable to train a useful network on the cloud

I honestly don't see how anything changed significantly in past 2 years. Benchmarks indicate that a V100 is barely 2x the performance of an RTX 2080 Ti [1] and a V100 is

• $2.50/h at Google [2]

• $13.46/h (4xV100) at Microsoft Azure [3]

• $12.24/h (4xV100) at AWS [4]

• ~$2.80/h (2xV100, 1 month) at LeaderGPU [5]

• ~$3.38/h (4xV100, 1 month) at Exoscale [6]

Other smaller cloud providers are in a similar price range to [5] and [6] (read: GCE, Azure and AWS are way overpriced...).

Using the 2x figure from [1] and adjusting the price for the build to a 2080 Ti and an AMD R9 3950X instead of the TR results in similar figures to the article you provided.

Please point me to any resources that show how the content of the article doesn't apply anymore, 2 years later. I'd be very interested to learn what actually changed (if anything).

NVIDIA's new A100 platform might be a game changer, but it's not yet available in public cloud offerings.

[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...

[2] https://cloud.google.com/compute/gpus-pricing

[3] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...

[4] https://aws.amazon.com/ec2/pricing/on-demand/

[5] https://www.leadergpu.com/#chose-best

[6] https://www.exoscale.com/gpu/

4 comments

You are missing TPU and spot/preemptible pricing, which need to be considered when we are talking about training cost. The big one to me is the ability to consistently train on V100s with spot pricing, which was not possible a couple of years ago (there wasn't enough spare capacity). Also, the improvement in cloud bandwidth for DL-type instances has helped distributed training a lot.
Nothing really has changed in the last two years in terms of training cost. I think the author is making unreasonable extrapolations based on changes in performance on the Dawn benchmarks. A lot of the results are fast but require a lot more compute / search time to find the best parameters and training regimen that lead to those fast convergence times. (Learning rate schedule, batch size, image size schedules, etc.) The point being that once the juice is squeezed out you aren’t going to continue to see training convergence time improvements on the same hardware.

Also, because you cited our GPU benchmarks, I also wanted to throw in a mention our GPU instances which have some of the lowest training costs on the Stanford Dawn Benchmarks discussed in the article.

https://lambdalabs.com/service/gpu-cloud

Another data point:

"For example, we recently internally benchmarked an Inferentia instance (inf1.2xlarge) against a GPU instance with an almost identical spot price (g4dn.xlarge) and found that, when serving the same ResNet50 model on Cortex, the Inferentia instance offered a more than 4x speedup."

https://towardsdatascience.com/why-every-company-will-have-m...

That data point talks about inference though, and nobody's arguing that deployment and inference have improved significantly over the past years.

I'm referring to training and fine-tuning, not inference, which - let's be honest - can be done on a phone these days.

I don't really know if those hardware breakthroughs that the article refers to already reflects in Cloud GPU performance, but software reflects nonetheless. So even though pricing has fluctuated marginally since 2018, it is just plain faster to train a neural network today because of software advances, from what I understood.
But that's not what the actual data says.

Here's some figures from an actual benchmark [1] w.r.t. training costs:

1. [Mar 2020] $7.43 (AlibabaCloud, 8xV100, TF v2.1)

2. [Sep 2018] $12.60 (Google, 8 TPU cores, TF v1.11)

3. [Mar 2020] $14.42 (AlibabaCloud, 128xV100, TF v2.1)

--

Training time didn't go down exponentially either [1]:

1. [Mar 2020] 0:02:38 (AlibabaCloud, 128 x V100, TF v2.1)

2. [May 2019] 0:02:43 (Huawei Cloud, 128 x V100, TF v1.13)

3. [Dec 2018] 0:09:22 (Huawei Cloud, 128 x V100, MXNet)

So again, I have to ask where exactly do these magical improvement occur (regarding training - inference is another matter entirely, I understand that)? I've yet to find a source that supports 4x to 10x cost reductions.

[1] https://dawn.cs.stanford.edu/benchmark/index.html

I guess I should have been more skeptical of the articles figures. But still, if we give the benefit of the doubt, is there any scenario we might see the reduction mentioned? 1000 to 10 USD?
The scenario is indeed there - if you take early 2017 numbers and restrict yourself to AWS/Google/Azure and outdated hardware and software, you can get to the US$1000 figure.

Likewise, if your other point of comparison is late 2019 AlibabaCloud spot pricing, you can get to US$10 for the same task.

Realistically, though, that's worst case 2017 vs best case 2019/2020. So you sure, you can get to that if you choose your numbers correctly.

They basically compared results from H/W that even in 2017 was 2 generations behind with the latest H/W. So yeah - between 2015 and 2019 we indeed saw a cost reduction from ~1000 to ~10 USD (on the major cloud provider vs best offer today scale).

I only take issue with the assumption that the trend continues this way, which it doesn't seem to.