I did a review for PixelCNN as a part of my summer internship, it covers a bit about how careful masking can be used to create a chain of conditional probabilities [0], which AFAIK is exactly how this "causal convolution" works (can't have dependencies in the 'future'). The PixelCNN and PixelRNN papers also cover this in a fair bit of detail. Ishaan Gulrajani's code is also a great implementation reference for PixelCNN / masking [1].