Hacker News new | ask | show | jobs
by zak 2354 days ago
(I'm one of the Cloud TPU product leads)

We've seen multiple BERT-related PyTorch models training successfully on Cloud TPUs, including training at scale on large, distributed Cloud TPU Pod slices.

Would you consider filing a GitHub issue at https://github.com/pytorch/xla or emailing pytorch-tpu@googlegroups.com to provide a bit more context about the specific issue you encountered?

Here's the current PyTorch/TPU troubleshooting guide, which provides information on how to collect and interpret metrics that are very helpful for debugging: https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.m...

Thanks!

1 comments

> BERT-related PyTorch models training successfully on Cloud TPUs

How do you see it? Do you look at your client's code?

Google wrote BERT and they provide technical support to the FB Pytorch TPU port so it's not entirely surprising. RoBERTa, (Fb's variant) would be a good candidate to test it with.
We only see code when customers open-source it or otherwise explicitly share it with us. We are directly in touch with several customers who are using the PyTorch / TPU integration, so we hear feedback from them, and we also run a variety of open-source PyTorch models on Cloud TPUs ourselves as we continue to improve the integration.