Hacker News new | ask | show | jobs
by lostdog 2078 days ago
This is such a great post. It really shows how much room for improvement there is in all released deep learning code. Almost none of the open source work is really production ready for fast inference, and tuning the systems requires a good working knowledge of the GPU.

The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.

6 comments

I'd alter your conclusion that open source work isn't production ready. As long as it works as described, it is production ready for at least some subset of use cases. There's just a lot of low hanging fruit re: performance improvement.

It's entirely valid to trade-off either a more straight-forward design or minimizing development time for performance and just throw hardware at the problem as needed.... companies do it all of the time.

Author here. I really appreciate your feedback.

Completely agree that almost none of the SoTA github repos are really ready for production and making this stuff work can be pretty hard.

Getting this done on C++ and moving up to the next level of performance is the focus of my next article :)

c++ or .net or rust or go whatever. almost anything can get the performance you want except python.

too bad such great ecosystems evolved around a language that can’t fully utilize the amazing hardware we have today.

> Drop Python and move fully into C++.

Do you have any experience with that?

Yes (though the details are private).

All the deep learning libraries are Python wrappers around C/C++ (which then call into CUDA). If you call the C++ layers directly, you have control over the memory operations applied to your data. The biggest wins come from reducing the number of copies, reducing the number of transfers between CPU and GPU memory, and speeding up operations by moving them from the CPU to the GPU (or vice versa).

This is basically what the article does, but if you want to squeeze out all the performance, the Python layer is still an abstraction that gets in the way of directly choosing what happens to the memory.

There are lots of cases where people use e.g. ROS on robots and Python to do inferences, which basically converts a ROS binary image message data into a Python list of bytes (ugh), then convert that into numpy (ugh), and then feed that into TensorFlow to do inferences. This pipeline is extremely sub-optimal, but it's what most people probably do.

All because nobody has really provided off the shelf usable deployment libraries. That Bazel stuff if you want to use the C++ API? Big nope. Way too cumbersome. You're trying to move from Python to C++ and they want you to install ... Java? WTF?

Also, some of the best neural net research out there has you run "./run_inference.sh" or some other abomination of a Jupyter notebook instead of an installable, deployable library. To their credit, good neural net engineers aren't expected to be good software engineers, but I'm just pointing out that there's a big gap between good neural nets and deployable neural nets.

I could see this working for the evaluation which basically just glues OpenCV video reading with Tensorflow to extract a handful of parameters per frame. The rest could stay in Python.

Do you have experience how single frame processing compares between Python and C++? I see that batched processing in Python gives me a huge speed boost which hints at inefficiencies at some point but I don't know if those are related to Python, Tensorflow or CUDA itself. (Or just bad resource management that requires re-initalization of some costly things in between evaluations.)

The fact that batching is faster does not inherently imply some sort of inefficiency, but rather is indicative of the fact that sequential memory access is faster than random.

I am curious what the basis behind the idea that Python is the performance bottleneck for inference is.

It's not that Python is by definition much slower than C++, rather, doing inference in C++ makes it much easier to control exactly when memory is initialised, copied and moved between CPU and GPU. Especially on frame-by-frame models like object detection this can make a big difference. Also, the GIL can be a real problem if you are trying to scale inference on multiple incoming video streams for example.
Control is probably the main point. The python interface makes things easy but doesn't offer enough control for my case. I tested it with a cut down example (no video decoding, no funny stuff) and it all comes down to the batch size that is passed to model.predict. Large batches level out at around 10000 fps depending on the GPU and batch size 1 goes down to 200 fps independent of the GPU. This tells me that some kind of overhead (hidden to me) is slowing things down. I guess that I have to go much deeper into the internals of TF to find out more - so far I did not because it's a large time hole that only offers better performance in a part that is not super critical right now.

The GIL and slowness of Python become a problem when processing multiple streams or doing further time consuming calculations in Python.

It depends, e.g. if you are moving data from memory into a Python data structure and then sending it to the GPU you will have a huge performance bottleneck in loading the data into Python.
Done that since 2015. You can look at https://github.com/jolibrain/deepdetect. C++ doesn't sound ideal to many, but when your target is production, it's pretty powerful, and since c++11 probably much more comfortable than most non practitioners think. For deep learning, it is excellent for bare metal and fitting with industrial applications. Never looked back. For R&D (gans, flows, RL, ...) Python remains easier to play with.
Funny how blaming GIL for being a bottleneck is the least researched/not backed by performance measurement (before/after) part of the article. Everyone loves to hate GIL. maybe there should be T-shirts made for this for the C++ loving folks out there.
I looked into the GIL saturation as measured with gil_load (https://github.com/chrisjbillington/gil_load), but perhaps I should have included more numbers here.

To me, seeing the GIL held for 40% of time and significant time spent waiting on GIL by other threads was a fairly strong indicator. Keen to hear your thoughts/experience on it.

All I see as the main insight of this article is that you shouldn't use pytorch hub as a baseline for inference speed.

I know a number of python frameworks (ie. detectron) that are fast.

I'd like to see the evidence that the performance bottleneck is python, esp. when asynchronous dispatch exists.