| I'm one of the authors of this CVPR paper -- cool to see our work mentioned on HN! The Uber paper from 2018 is one that has been floating around in the back of my head for a while. Decoding DCT to RGB is essentially an 8x8 stride 8 convolution -- it seems wasteful to perform this operation on CPU for data loading, then immediately pass the resulting decoded RGB into convolution layers that probably learn similar filters as those used during DCT decoding anyway. Compared to the earlier Uber paper, our CVPR paper makes two big advances: (1) Cleaner architecture: The Uber paper uses a CNN, while we use a ViT. It's kind of awkward to modify an existing CNN architecture to accept DCT instead of RGB since the grayscale data is 8x lower resolution than RGB, and the color information is 16x lower than RGB. With a CNN, you need to add extra layers to deal with the downsampled input, and use some kind of fusion mechanism to fuse the luma/chroma data of different resolution. With a ViT it's very straightforward to accept DCT input; you only need to change the patch embedding layer, and the body of the network is unchanged. (2) Data augmentation: The original Uber paper only showed speedup during inference. During training they need to perform data augmentation, so convert DCT to RGB, augment in RGB, then convert back to DCT to feed the augmented data to the model. This means that their approach will be slower during training vs an RGB model. In our paper we show to to perform all standard image augmentations directly in DCT, so we can get speedups during both training and inference. Happy to answer any questions about the project! |
Then why not do it on the GPU? Feels like exactly the sort of thing it was designed to do.
Or alternatively, use nvjpeg?