| HN Mirror

My intuition is that the doubling up to 512 does increase the receptive field, but you're essentially building a non-linear convolutional filter with a kernel size of 1024. The network benefits from stacking multiple of these groups, because each group can again convolve over the previous outputs at every temporal distance, which allows for learning deeper/higher level features. It is similar to the stacked 2d convolutions used for images, where every subsequent convolutional layers learns more abstract and higher level features/attributes of the data. This is just intuition though, there is no evidence yet that this holds for wavenet's architecture.