Hacker News new | ask | show | jobs
by rtkaratekid 2447 days ago
I work at a small company as an engineer and recently was asked to do a project that would require some neural net magic. I had some experience with keras/tensorflow so that was my first choice.

Despite the absolute nightmare of getting it installed and running on a gpu, I managed it and had a fantastic model. It was doing so well that the company wanted to expand the project and build out a multi-gpu rig as part of it. So I get building that environment and install the latest CUDA, cuDNN, nvidia driver and use tensorflow 2.0 aaaaaand it wouldn't work. I actually spent a long time hacking on it till on a forum I read that it was just a bug that hadn't been fixed yet.

At this point I decided to see what Pytorch was like. In literally one day I installed everything and migrated my project completely over to pytorch. Same speed, same accuracy, works perfectly on a multi-gpu rig when I set it to. It was like a breath of fresh air.

The next day I wrote some C++ to import a saved pytorch model so it could run in a deployment environment. The C++ api is also great. The docs are lacking a little bit, but an Facebook researcher mentioned to me on the forums that they're hoping to have it all done by next month.

It's unlikely that I'll be going back to tensorflow.

6 comments

When I used TensorFlow (briefly), it seemed there were tons of hidden assumptions my stuff had to follow or it wouldn’t work. PyTorch has a few I’ve run into, but mostly seems to “just work”. That’s why I think it’s much better for building anything novel (at least for the first time).
I prefer PyTorch too. For Tensorflow deployments, I have found Nix/NixOS work great. Arch is also good.
Could you provide a link or resources how to run pytorch models from c++?
https://pytorch.org/cppdocs/ This is a brief description of the API. It worked fine for us even for complex models.
What was the bug?
Late reply, but it was a bug using XLA GPUs to add concurrency to the training process. Maybe someone figured it out or fixed it, but I've moved on already.
You shouldn’t need XLA for multi GPU training. Have you tried training without it?
Regarding TensorFlow, you could use the docker images and run everything in containers instead. They tend to work pretty fine out of the box.
I looked into this but hadn't gotten to the point of using them when I made the switch. I may go back and try it out to see how it goes. Thanks for the encouragement.
Complexity of installing TensorFlow, even with the inclusion of custom compilation and hacking Bazel (to make it work under CUDA version that it doesn't officially support) is low, compared to releasing a model that works in production.

Because of that, it doesn't make much sense to judge a "differential programming language" like TensorFlow or PyTorch by the ease of installation. It'd be like saying "I prefer C# over C++" because it is easier to install.

I did say I had a fantastic model with Tensorflow. I gave up after a while because I didn't have time to hack on that stuff. I wouldn't mind learning to and trying it out, but the nature of the small company meant I needed to find a solution sooner. Now I have comparable results with Pytorch and it's easier to work with. That's a win/win in my book.
Your original comment stated that you've explicitly had difficulty in installation: "So I get building that environment and install the latest CUDA, cuDNN, nvidia driver and use tensorflow 2.0 aaaaaand it wouldn't work. I actually spent a long time hacking on it till on a forum I read that it was just a bug that hadn't been fixed yet.".

I don't want to say anything encouraging or discouraging about TensorFlow. Just that it doesn't make much sence to make a judgement based on installation experience. Installing TensorFlow or PyTorch is a very small percentage of man-hours, compared to releasing a DNN to production.