Hacker News new | ask | show | jobs
by gchamonlive 2171 days ago
I remember this article from 2018: https://medium.com/the-mission/why-building-your-own-deep-le...

Hackernews discussion for the article: https://news.ycombinator.com/item?id=18063893

It really is interesting how this is changing the dynamics of neural network training. Now it is affordable to train a useful network on the cloud, whereas 2 years ago that would be reserved to companies with either bigger investments or an already consolidated product.

2 comments

> Now it is affordable to train a useful network on the cloud

I honestly don't see how anything changed significantly in past 2 years. Benchmarks indicate that a V100 is barely 2x the performance of an RTX 2080 Ti [1] and a V100 is

• $2.50/h at Google [2]

• $13.46/h (4xV100) at Microsoft Azure [3]

• $12.24/h (4xV100) at AWS [4]

• ~$2.80/h (2xV100, 1 month) at LeaderGPU [5]

• ~$3.38/h (4xV100, 1 month) at Exoscale [6]

Other smaller cloud providers are in a similar price range to [5] and [6] (read: GCE, Azure and AWS are way overpriced...).

Using the 2x figure from [1] and adjusting the price for the build to a 2080 Ti and an AMD R9 3950X instead of the TR results in similar figures to the article you provided.

Please point me to any resources that show how the content of the article doesn't apply anymore, 2 years later. I'd be very interested to learn what actually changed (if anything).

NVIDIA's new A100 platform might be a game changer, but it's not yet available in public cloud offerings.

[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...

[2] https://cloud.google.com/compute/gpus-pricing

[3] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...

[4] https://aws.amazon.com/ec2/pricing/on-demand/

[5] https://www.leadergpu.com/#chose-best

[6] https://www.exoscale.com/gpu/

You are missing TPU and spot/preemptible pricing, which need to be considered when we are talking about training cost. The big one to me is the ability to consistently train on V100s with spot pricing, which was not possible a couple of years ago (there wasn't enough spare capacity). Also, the improvement in cloud bandwidth for DL-type instances has helped distributed training a lot.
Nothing really has changed in the last two years in terms of training cost. I think the author is making unreasonable extrapolations based on changes in performance on the Dawn benchmarks. A lot of the results are fast but require a lot more compute / search time to find the best parameters and training regimen that lead to those fast convergence times. (Learning rate schedule, batch size, image size schedules, etc.) The point being that once the juice is squeezed out you aren’t going to continue to see training convergence time improvements on the same hardware.

Also, because you cited our GPU benchmarks, I also wanted to throw in a mention our GPU instances which have some of the lowest training costs on the Stanford Dawn Benchmarks discussed in the article.

https://lambdalabs.com/service/gpu-cloud

Another data point:

"For example, we recently internally benchmarked an Inferentia instance (inf1.2xlarge) against a GPU instance with an almost identical spot price (g4dn.xlarge) and found that, when serving the same ResNet50 model on Cortex, the Inferentia instance offered a more than 4x speedup."

https://towardsdatascience.com/why-every-company-will-have-m...

That data point talks about inference though, and nobody's arguing that deployment and inference have improved significantly over the past years.

I'm referring to training and fine-tuning, not inference, which - let's be honest - can be done on a phone these days.

I don't really know if those hardware breakthroughs that the article refers to already reflects in Cloud GPU performance, but software reflects nonetheless. So even though pricing has fluctuated marginally since 2018, it is just plain faster to train a neural network today because of software advances, from what I understood.
But that's not what the actual data says.

Here's some figures from an actual benchmark [1] w.r.t. training costs:

1. [Mar 2020] $7.43 (AlibabaCloud, 8xV100, TF v2.1)

2. [Sep 2018] $12.60 (Google, 8 TPU cores, TF v1.11)

3. [Mar 2020] $14.42 (AlibabaCloud, 128xV100, TF v2.1)

--

Training time didn't go down exponentially either [1]:

1. [Mar 2020] 0:02:38 (AlibabaCloud, 128 x V100, TF v2.1)

2. [May 2019] 0:02:43 (Huawei Cloud, 128 x V100, TF v1.13)

3. [Dec 2018] 0:09:22 (Huawei Cloud, 128 x V100, MXNet)

So again, I have to ask where exactly do these magical improvement occur (regarding training - inference is another matter entirely, I understand that)? I've yet to find a source that supports 4x to 10x cost reductions.

[1] https://dawn.cs.stanford.edu/benchmark/index.html

I guess I should have been more skeptical of the articles figures. But still, if we give the benefit of the doubt, is there any scenario we might see the reduction mentioned? 1000 to 10 USD?
The scenario is indeed there - if you take early 2017 numbers and restrict yourself to AWS/Google/Azure and outdated hardware and software, you can get to the US$1000 figure.

Likewise, if your other point of comparison is late 2019 AlibabaCloud spot pricing, you can get to US$10 for the same task.

Realistically, though, that's worst case 2017 vs best case 2019/2020. So you sure, you can get to that if you choose your numbers correctly.

They basically compared results from H/W that even in 2017 was 2 generations behind with the latest H/W. So yeah - between 2015 and 2019 we indeed saw a cost reduction from ~1000 to ~10 USD (on the major cloud provider vs best offer today scale).

I only take issue with the assumption that the trend continues this way, which it doesn't seem to.

I trained a useful neural network and prototyped a viable [failed] startup technology something like 4 years ago on a 1080ti with a mid range CPU. It was enough to get me meetings with a couple of the largest companies in the world.

Yeah it took 12-24 hours to do what I could login to AWS and accomplish in minutes with parallel GPUs...but practical solutions were already in reach. The primary changes now are buzz and possibly unprecedent rate of research progress.