| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by elmarhaussmann 2980 days ago
	Hi, author here. The motivation for this article came out of the HN discussion on a previous post (https://news.ycombinator.com/item?id=16447096). There was a lot of valuable feedback - thanks for that. Happy to answer questions!

8 comments

puzzle 2980 days ago

Don't TPUs get sustained use discounts? I know they're not preemptible. That would be comparable to AWS reserved instances.

EDIT: you don't get sustained use discounts, either, at the moment. You can get either for GCP GPUs, though. Perhaps that will change once TPUs are out of beta?

link

sdenton4 2980 days ago

"As shown above, the top-1 accuracy after 90 epochs for the TPU implementation is 0.7% better. This may seem minor, but making improvements at this already very high level is extremely difficult and, depending on the application, such small improvements may make a big difference in the end."

Any idea of how much variation in accuracy you get on different training runs of the same model on the same hardware? My understanding is that model quality can and does vary from one run to the next on these kinds of large datasets - from a single observation, it's hard to know if the difference is real or noise.

link

antognini 2980 days ago

I've been running a lot of these resnet-50 experiments lately and the run-to-run variation is very small, on the order of 0.1%. It's actually pretty amazing how consistent training is given that the initialization is always different and the data is sampled differently on each run. (As an aside, it took us about three weeks to track down a bug that was causing the model to consistently reach an accuracy 1% lower than it was supposed to.)

link

elmarhaussmann 2980 days ago

Indeed, that's also my experience. ImageNet is pretty huge (although 'it's the new MNIST') so that seems to help converging to very similar solutions and accuracies.

Tracking down bugs in convergence is really costly in these settings. We had a problem in pre-processing that took us quite a while to figure out...

link

TheLoneAdmin 2980 days ago

AMD - Where does their hardware stand in the race for ML? What changes would AMD need to make to be competitive?

link

my123 2980 days ago

Their hardware is fine. Their software is starting to get good too now. They're finishing MIOpen, a set of CUDA compatible libraries with which you can use Tensorflow (TF uses the builtin CUDA libs too, not only CUDA itself, as does CNTK). ROCm provides a CUDA implementation for AMD systems.

link

shaklee3 2980 days ago

Their hardware doesn't have the equivalent of a tensor core as far as I know, so they would be way behind on these benchmarks.

link

shaklee3 2980 days ago

Nice work. I've only seen anecdotal stories about how TPU is faster, but never something as detailed as this.

link

kbob 2979 days ago

I am not an ML guy, so I'm asking from a position of ignorance. (-:

But what's going on when some of the implementations of a standard algorithm don't converge, and different hardware has different accuracy rates on the same algorithm? Are DNNs really that flaky? And does it really make sense to be doing performance comparisons when the accuracy performance doesn't match?

Is the root problem that ResNet-50 works best with a smaller batch size?

And how do you do meaningful research into new DNNs if there's always an "Maybe if I ran it again over there I'd get better results" factor?

Thank you.

link

MrBuddyCasino 2980 days ago

I found it interesting that they are so close together in performance - I mean what are the odds that they end up within 2% of each other?

link

jacksmith21006 2980 days ago

The TPUs are doing almost 2x the images for the same cost.

That is not all that close is it?

link

elmarhaussmann 2980 days ago

Yeah, pretty big coincidence. However, this may change with the next TensorFlow versions, which supposedly has further speed improvements for the TPUv2.

Note also, that the ~2% performance difference is only on one model (ResNet-50) and cannot be generalized to all workloads/all of deep learning (at least not without further proof).

link

Jabbles 2980 days ago

Do you have more information about this bit?

the TPU implementation applies very compute-intensive image pre-processing steps and actually sacrifices raw throughput

Thanks

link

elmarhaussmann 2980 days ago

In general, you try to keep the TPU/GPU busy 100%, so enough data needs to be readily accessible at any point in time. In this example, images needs to be read from disk, decoded, transformed (cropped, resized, normalized etc.) before they can be fed to the TPU. The transformations can be computationally intensive so they actually become a bottleneck.

In terms of how much compute power the TPU pre-processing needs I only have very rough numbers: I ran the same pre-processing while training ResNet-50 on a node with 4 GPUs and it was consistently utilizing >22 CPU cores (including all of the other CPU-tasks while training).

link

pakl 2980 days ago

What about your LSTM-based model that didn’t converge in your earlier TPU benchmarks in February?

link