Overview: This paper propose the Gated PixelCNN which improved the speed of PixelCNN and achieve SOTA in image generation/modeling.

Image modeling and raster scan: Just as language modeling, Image modeling is the problem to parametrise the image probabilistic distribution according to a given training set. One big challenge is the 2D structure of image itself: Pixels does not come with a sequential order by nature. To overcome this, we introduces Raster Scan Order where we traverse each pixel line by line. The final probability is given by:
\begin{equation}
p(x|\theta) = \prod_t^{n^2} p(x_t|x_{<t}, \theta)
\end{equation}

Gated PixelCNN: PixelXCNN is proposed in the original paper Pixel recurrent neural networks —— although it improved the computationally, it is not beating PixelRNN in log-likelihood. This paper improved the PixelCNN in the following aspects:

Gated activation: The result of non-linearity is element-wisely produced with a sigmoid gated: $y = tanh(W_{k,f} * x) \odot \sigma(W_{k,g} ∗ x)$

Vertical and horizontal stack for blind receptive field: The mask used in PixelCNN will result a blind top left region. One way to fix this is to combine two shape of masked reception kernel: The first one will takes in the above rectangular and the second one will takes in the left K pixels.

Conditioning: A hidden variable (e.g. encoded variable or class labels) are passed to the non-linearity for all layers. This can be chosen if it is location dependent.

Result: The Gated PixelCNN model achieve SOTA on unconditioning image modeling (log-likelihood on CIFAR-10 and ImageNet). Conditioning on classes does not further improves the performance but the visual results looks better. Experiments on Portrait shows the interpolation of embeddings can be translated smoothly to the interpolation of the generated images. End to end auto encoder result also shows better performance than PixelCNN.