StyleSwin: Transformer-based GAN for High-resolution Image Generation
CVPR-2022, University of Science and Technology of China & Microsoft Research Asia.
This post will cover the recent paper that is called StyleSwin authored by Bowen Zhang et. al., which yields state of the art results in high resolution image synthesis and takes its origins from Swin Transformer and StyleGAN. In the flow of the paper review, I will also provide the ideas inspired from these reference papers in detail.
Table of Content
1. Overview
1.1. Generator Overview
1.2 Discriminator Overview
2. Related Works
2.1 Swin Transformer
2.1.1 Architecture Details
2.1.2 Window Based Local Attention
3. StyleSwin Generator
3.1 Style Injection
3.2 Double Attention
4. Blocking Artifacts
4.1 Wavelet Discriminator
5. Conclusion
1. Overview
The main motivation of the paper is creating a pure transformer based image generative model that is capable of competing against convolution based generative models in high-resolution image synthesis task.
1.1 Generator Overview
Authors use Swin Transformer blocks in the Generator along with the style injection mechanism. Swin Transformer uses window based Multi-Head Self Attention (MSA), which has linear cost with respect to image size as compared to global attention used in ViT with quadratic cost. This complexity reduction comes with a cost: Since attention is applied window by window manner, spatial coherency is broken and this result in blocking artifacts occurring in the inference time. As an attempt to increase the receptive field and consequently long range dependency authors propose to use double attention.
1.2 Discriminator Overview
Discriminator is not transformer based but convolution based instead. This transition from transformer based discriminator to convolution based discriminator reduces the modelling capacity and training stability as well. To enhance reduced capabilities, authors perform several ablation studies including style injection, double attention, local-global positional encoding. Also to tackle the blocking artifact effect wavelet discriminator is used.
2. Related Works
To grasp the idea better let’s have a look at the related works that help construct StyleSwin.
2.1 Swin Transformer
Swin Transformer is also a work of Microsoft Research Asia as in the case of StyleSwin and it was awarded as Best Paper in ICCV 2021. The motivation behind the Swin Transformer is serving a general-purpose backbone for computer vision tasks. By the time Swin Transformer introduced, it had surpassed the state-of-the art result of different computer vision domains by large margins.
2.1.1 Architecture Details
Swin takes its name from Shifted Windows which is the underlying modification to previous vision transformers. Swin Transformer can progressively produce feature maps of smaller resolution (smaller spatial size) while increasing the channel size. The design of Swin Transformer is similar to convolutional neural network with spatial dimension reduced by two while channel dimension is doubled at every consecutive layer. Let’s visualize this process to better understand.
From now on, each patch can be seen as tokens that corresponds to [4 x 4] part of the original image. Similar to the ViT, each patch is given to linear layer to obtain an embedding of demanded size C.
Then, these tokens are given to Transformer Block which contains Window Multi-Head Self Attention (W-MSA) and Shifted Window Multi-Head Self Attention (SW-MSA) modules. After Transformer Blocks, Patch Merging operation occurs which merges [ 2 x 2 ] neighboring patches. See Merging operation below.
Below, you see the above idea implemented on tokens. Before Swin Transformer Block, there are 16 tokens in total. Then Transformer Block outputs 4 output tokens each of which correspond to [8 x 8] part of the original image. Although number of tokens are reduced by factor of 2 in each spatial dimension (height and width), number of channels is increased by 2 as well.
An important note here: After Patch Merging operation, each merged patch is again splitted into 4 x 4 parts. See Figure 2.a and below visualization.
If we apply one more Swin Transformer Block, resulting feature map will span the entire image and there will be total 1 token of size 4C at the end.
Below you see the above idea implemented on model architecture. As a reference of above representation of merging, you may take a look at the Figure 2.a as well.
So far, we have seen how the dimensions of the linear embeddings (tokens in other words) changes in successive Transformer Blocks. As a summary just like in the ConvNets, Swin Transformer output is built hierarchically by down-sampling the spatial size and up-sampling the channel size by factor of 2.
Now, it is time for understanding how attention mechanism works in Swin Transformer.
2.1.2 Window Based Local Attention
As denoted in the above visualizations, Transformer Block contains W-MSA and SW-MSA blocks. They are exactly same in architecture with the only difference being type of attention applied to input tokens.
What differs Local Attention from Global Attention is that in the local attention, tokens that attends each other are restricted to a fixed area.
A modified version of local attention applied in Swin Transformers that contains two processes:
- Window Multi-Head Self Attention (W-MSA)
- Shifted Window Multi-Head Self Attention (SW-MSA)
Figure 13 represents the main idea of W-MSA.
But there is an important problem of W-MSA approach which is that patches located in different local windows does not have an effect on their attention values, in other words the information is restricted to only the area that local window spans. To overcome this issue, a shifted window approach is proposed. Again local windows span non-overlapping patches but in this case the spanned patches are not the same as patches spanned in W-MSA.
This concludes our discussion on Swin Transformer, now we will take a look at the StyleSwin Generator architecture and studies to obtain state-of-the art results.
3. StyleSwin Generator
This section investigates the Generator architecture and operations performed in detail.
3.1 Style Injection
Latent code sampled from normal distribution and is given to style injection layer which consists of 8 Fully Connected layers. Then obtained style vector is given 2 times (1 for each Adaptive Instance Normalization Layer) to each transformer block after a linear mapping hence the block is named “A” which stands for affine.
Why not giving style vector directly instead of applying linear mapping?
To find the answer, we first need to operation performed by Adaptive Instance Normalization (AdaIN Layer):
To better understand the process performed by AdaIN Layer, suppose that we have the following tensor, x, as:
Then for each separate feature map, mean and variance are calculated. For one feature map operations performed by AdaIN Layer are as follows:
Now, we come back to the question: Why not giving style vector directly instead of applying linear mapping? Since each channel (feature map) is aligned differently we need to learn different mean and variance for each separate channel using style vector. If style vector belongs to W dimensional space (notation used in paper, see Figure 16), with affine mapping it is mapped to C dimensional space where C is the channel dimension.
3.2 Double Attention
In order to achieve larger receptive field and obtain improved generation quality, authors propose Double Attention mechanism.
In Swin Transformer section, we saw that general repeated block consists of 2 Swin Transformers where the only difference is the type of attention being applied, i.e W-MSA & SW-MSA. This architecture choice needs 2 distinct transformer block whereas in StyleSwin, process modified in such a way that single transformer block uses both W-MSA and SW-MSA. Let’s see how it is done.
In Multi-Head Self Attention, N attention heads divided into two, the first half computes W-MSA and the second half computes SW-MSA. Then the result is concatenated and a projection matrix is applied to concatenated head outputs as an attempt to mix outputs.
4. Blocking Artifacts
Generator Architecture in Figure 16 yields state of the art results on 256 x 256 images. But when it comes to synthesis of 1024 x 1024 images blocking artifacts occur.
Authors explains the artifact effects as the use of window based attention mechanism.
“ Hence, we are certain it is the window-wise processing that breaks the spatial coherency and causes the blocking artifacts.”
To strengthen this reasoning, they applied window based attention on 1D continuous data with strided windows. They adopt 1 Attention head and random projection matrix and found following:
4.1 Wavelet Discriminator
In Discriminator Overview section, we pointed out that Discriminator has an tremendous effect on training stability so the architecture is Convolution based rather than Transformer.
Blocking artifacts can be clearly seen in Fourier spectrum. Fourier spectrum shows that blocking artifacts follow a periodic pattern.
Wavelet discriminator deals with blocking artifacts remarkably. Above you see the proposed discriminator architecture. The discriminator downsamples the input image by convolution operations and on each stage checks the frequency discrepancy relative to real images after discrete wavelet decomposition (DWT).
5. Conclusion
Our discussion on StyleSwin ends here. In this blog post, we reviewed the StyleSwin paper which obtains state of the art results on high resolution image generation task. Please follow Arxiv link to read the paper, and for official implementation you can follow Github link to source code.