Hacker News new | ask | show | jobs
by jcjohns 1064 days ago
I'm one of the authors of this CVPR paper -- cool to see our work mentioned on HN!

The Uber paper from 2018 is one that has been floating around in the back of my head for a while. Decoding DCT to RGB is essentially an 8x8 stride 8 convolution -- it seems wasteful to perform this operation on CPU for data loading, then immediately pass the resulting decoded RGB into convolution layers that probably learn similar filters as those used during DCT decoding anyway.

Compared to the earlier Uber paper, our CVPR paper makes two big advances:

(1) Cleaner architecture: The Uber paper uses a CNN, while we use a ViT. It's kind of awkward to modify an existing CNN architecture to accept DCT instead of RGB since the grayscale data is 8x lower resolution than RGB, and the color information is 16x lower than RGB. With a CNN, you need to add extra layers to deal with the downsampled input, and use some kind of fusion mechanism to fuse the luma/chroma data of different resolution. With a ViT it's very straightforward to accept DCT input; you only need to change the patch embedding layer, and the body of the network is unchanged.

(2) Data augmentation: The original Uber paper only showed speedup during inference. During training they need to perform data augmentation, so convert DCT to RGB, augment in RGB, then convert back to DCT to feed the augmented data to the model. This means that their approach will be slower during training vs an RGB model. In our paper we show to to perform all standard image augmentations directly in DCT, so we can get speedups during both training and inference.

Happy to answer any questions about the project!

1 comments

> Decoding DCT to RGB is essentially an 8x8 stride 8 convolution -- it seems wasteful to perform this operation on CPU for data loading

Then why not do it on the GPU? Feels like exactly the sort of thing it was designed to do.

Or alternatively, use nvjpeg?

This makes sense in theory, but is hard to get working in practice.

We tried using nvjpeg to do JPEG decoding on GPU as a additional baseline, but using it as a drop-in replacement to a standard training pipeline gives huge slowdowns for a few reasons:

(1) Batching: nvjpeg isn't batched; you need to decode one at a time in a loop. This is slow but could in principle be improved with a better GPU decoder.

(2) Concurrent data loading / model execution: In a standard training pipeline, the CPU is loading and augmenting data on CPU for the next batch in parallel with the model running forward / backward on the current batch. Using the GPU for decoding blocks it from running the model concurrently. If you were careful I think you could probably find a way to interleave JPEG decoding and model execution on the GPU, but it's not straightforward. Just naively swapping out to use nvjpeg in a standard PyTorch training pipeline gives very bad performance.

(3) Data augmentation: If you do DCT -> RGB decoding on the GPU, then you have to think about how and where to do data augmentation. You can augment in DCT either on CPU or on GPU; however DCT augmentation tends to be more expensive than RGB augmentation (especially for resize operations), so if you are already going through the trouble of decoding to RGB then it's probably much cheaper to augment in RGB. If you augment in RGB on GPU, then you are blocking parallel model execution for both JPEG decoding and augmentation, and problem (2) gets even worse. If you do RGB augmentation on CPU, you end up with and extra GPU -> CPU -> GPU round trip on every model iteration which again reduces performance.

I'm just a low tier ML engineer, but I'd say you generally want to avoid splitting GPU resources over many libraries, to the extent it's even practically possible.
Could you parallelise your parallel processors? ie. offload this work to a separate, (perhaps not even as beefy) GPU.

Akin to streamers having one GPU that they use for gaming and a second GPU used for encoding their stream.