StyleSwin: Transformer-based GAN for High-resolution Image Generation

Hıdır Yeşiltepe
9 min readJun 19, 2022

--

CVPR-2022, University of Science and Technology of China & Microsoft Research Asia.

Figure 1: StyleSwin samples on FFHQ 1024 x 1024 and LSUN Church 256 x 256

This post will cover the recent paper that is called StyleSwin authored by Bowen Zhang et. al., which yields state of the art results in high resolution image synthesis and takes its origins from Swin Transformer and StyleGAN. In the flow of the paper review, I will also provide the ideas inspired from these reference papers in detail.

Table of Content

1. Overview
1.1. Generator Overview
1.2 Discriminator Overview
2. Related Works
2.1 Swin Transformer
2.1.1 Architecture Details
2.1.2 Window Based Local Attention
3. StyleSwin Generator
3.1 Style Injection
3.2 Double Attention
4. Blocking Artifacts
4.1 Wavelet Discriminator
5. Conclusion

1. Overview

The main motivation of the paper is creating a pure transformer based image generative model that is capable of competing against convolution based generative models in high-resolution image synthesis task.

1.1 Generator Overview

Authors use Swin Transformer blocks in the Generator along with the style injection mechanism. Swin Transformer uses window based Multi-Head Self Attention (MSA), which has linear cost with respect to image size as compared to global attention used in ViT with quadratic cost. This complexity reduction comes with a cost: Since attention is applied window by window manner, spatial coherency is broken and this result in blocking artifacts occurring in the inference time. As an attempt to increase the receptive field and consequently long range dependency authors propose to use double attention.

1.2 Discriminator Overview

Discriminator is not transformer based but convolution based instead. This transition from transformer based discriminator to convolution based discriminator reduces the modelling capacity and training stability as well. To enhance reduced capabilities, authors perform several ablation studies including style injection, double attention, local-global positional encoding. Also to tackle the blocking artifact effect wavelet discriminator is used.

2. Related Works

To grasp the idea better let’s have a look at the related works that help construct StyleSwin.

2.1 Swin Transformer

Swin Transformer is also a work of Microsoft Research Asia as in the case of StyleSwin and it was awarded as Best Paper in ICCV 2021. The motivation behind the Swin Transformer is serving a general-purpose backbone for computer vision tasks. By the time Swin Transformer introduced, it had surpassed the state-of-the art result of different computer vision domains by large margins.

Figure 2: (a) Proposed Swin Transformer builds hierarchical feature maps by merging image patches. (b) ViT uses fixed-size image patches.

2.1.1 Architecture Details

Figure 3: Swin Transformer Architecture

Swin takes its name from Shifted Windows which is the underlying modification to previous vision transformers. Swin Transformer can progressively produce feature maps of smaller resolution (smaller spatial size) while increasing the channel size. The design of Swin Transformer is similar to convolutional neural network with spatial dimension reduced by two while channel dimension is doubled at every consecutive layer. Let’s visualize this process to better understand.

Figure 4: Given input image is splitted into fix size patches. Swin Transformer uses patch size of 4. Note that each single patch is of size [4 x 4] to get [16 x 16] as the resultand feature map size.

From now on, each patch can be seen as tokens that corresponds to [4 x 4] part of the original image. Similar to the ViT, each patch is given to linear layer to obtain an embedding of demanded size C.

Figure 5: Each flattened patch of size [4 x 4 x 3] (3 for RGB) is given to linear layer to obtain an embedding of demanded size C. If the input image has spatial size [H x W], there will be total (H/4 x W/4) many patches and tokens (linear embeddings) of dimension C where 4 is the patch size.

Then, these tokens are given to Transformer Block which contains Window Multi-Head Self Attention (W-MSA) and Shifted Window Multi-Head Self Attention (SW-MSA) modules. After Transformer Blocks, Patch Merging operation occurs which merges [ 2 x 2 ] neighboring patches. See Merging operation below.

Figure 6: At the end of the Swin Transformer Block, Patch Merging operation takes 2 x 2 neighboring patches each of which of spatial size [4 x 4], then merges them to get a spatial size of [8 x 8]. But note that each merged part again splitted into 4 x 4 patches.

Below, you see the above idea implemented on tokens. Before Swin Transformer Block, there are 16 tokens in total. Then Transformer Block outputs 4 output tokens each of which correspond to [8 x 8] part of the original image. Although number of tokens are reduced by factor of 2 in each spatial dimension (height and width), number of channels is increased by 2 as well.

Figure 7: (Left, Before Swin Transformer operates) There are total (H/4 x W/4) many tokens (if initial image is 16 x 16 there are total 16 token as in the above illustration) each of which have the dimension of C. (Right, After Swin Transformer operates) There are total (H/8 x H/8) many tokens (if initial image is 16 x 16 there are total 4 token as in the above illustration) each of which have dimension of 2C.

An important note here: After Patch Merging operation, each merged patch is again splitted into 4 x 4 parts. See Figure 2.a and below visualization.

Figure 8: If the input image is of size 16 x 16, (Left) splitted pink patch consists of 4 x 4 patches each and there are 16 of them meaning that each small patch is 1 x 1. (Right) Merged orange patch consists of 4 x 4 patches each and there are 4 of them meaning that each small patch is 2 x 2.

If we apply one more Swin Transformer Block, resulting feature map will span the entire image and there will be total 1 token of size 4C at the end.

Figure 9: Figure 5: At the end of the Swin Transformer Block, Patch Merging operation takes 2 x 2 neighboring patches each of which of spatial size [8 x 8], then merges them to get a spatial size of [16 x 16]. But note that each merged part again splitted into 4 x 4 patches.

Below you see the above idea implemented on model architecture. As a reference of above representation of merging, you may take a look at the Figure 2.a as well.

Figure 10: (Before Swin Transformer operates) There are total (H/8 x W/8) many tokens (if initial image is 16 x 16 there are total 4 token as in the above illustration) each of which have the dimension of 2C. (After Swin Transformer operates) There are total (H/16 x H/16) many tokens (if initial image is 16 x 16 there are total 1 token as in the above illustration) each of which have dimension of 4C.

So far, we have seen how the dimensions of the linear embeddings (tokens in other words) changes in successive Transformer Blocks. As a summary just like in the ConvNets, Swin Transformer output is built hierarchically by down-sampling the spatial size and up-sampling the channel size by factor of 2.

Now, it is time for understanding how attention mechanism works in Swin Transformer.

2.1.2 Window Based Local Attention

Figure 11: Two Successive Swin Transformer Blocks

As denoted in the above visualizations, Transformer Block contains W-MSA and SW-MSA blocks. They are exactly same in architecture with the only difference being type of attention applied to input tokens.

Figure 12: Illustration of Global Attention used in ViT. Global Attention module takes each token and calculates an attention score based on every token including the selected token.

What differs Local Attention from Global Attention is that in the local attention, tokens that attends each other are restricted to a fixed area.

Figure 13: Illustration of Local Attention. Instead of calculating the attention scores based on every token, the area that tokens can attend each other are restricted.

A modified version of local attention applied in Swin Transformers that contains two processes:

  • Window Multi-Head Self Attention (W-MSA)
  • Shifted Window Multi-Head Self Attention (SW-MSA)

Figure 13 represents the main idea of W-MSA.

Figure 14: A cascade of W-MSA and SW-MSA layers. Suppose that given image is [64 x 64]. Then each patch is [8 x 8]. (Left) Attention is applied only between patches in the same local window. There is no effect of patches belonging different local windows while calculating the attention values. (Right) Window is shifted through the image and again patches located in the same window are used to calculate attention values.

But there is an important problem of W-MSA approach which is that patches located in different local windows does not have an effect on their attention values, in other words the information is restricted to only the area that local window spans. To overcome this issue, a shifted window approach is proposed. Again local windows span non-overlapping patches but in this case the spanned patches are not the same as patches spanned in W-MSA.

Figure 15: Local window is shifted across image by spanning non-overlapping patches.

This concludes our discussion on Swin Transformer, now we will take a look at the StyleSwin Generator architecture and studies to obtain state-of-the art results.

3. StyleSwin Generator

This section investigates the Generator architecture and operations performed in detail.

Figure 16: StyleSwin Generator Architecture

3.1 Style Injection

Latent code sampled from normal distribution and is given to style injection layer which consists of 8 Fully Connected layers. Then obtained style vector is given 2 times (1 for each Adaptive Instance Normalization Layer) to each transformer block after a linear mapping hence the block is named “A” which stands for affine.

Why not giving style vector directly instead of applying linear mapping?

To find the answer, we first need to operation performed by Adaptive Instance Normalization (AdaIN Layer):

Equation 1: Equation performed by Adaptive Instance Normalization. Given feature map x is normalized with respect to parameters of style y. In other words, mean and variance of x are channel-wise aligned to match those of y.

To better understand the process performed by AdaIN Layer, suppose that we have the following tensor, x, as:

Figure 17: (Left) For the sake of understanding, consider the given tensor as x which consists of 3 channels and has [3 x 3 ] spatial dimension. (Right) This tensor consists of 3 separate feature maps: 1 for each channels, on the right 1 feature map is shown.

Then for each separate feature map, mean and variance are calculated. For one feature map operations performed by AdaIN Layer are as follows:

Equation 2: Operations performed by Adaptive Instance Normalization. Mean and variance of each channel is calculated and then it is aligned with the mean and variance of the given style vector. An important note here, each separate channel is aligned differently, meaning that uses different scalar mean(y) and var(y).

Now, we come back to the question: Why not giving style vector directly instead of applying linear mapping? Since each channel (feature map) is aligned differently we need to learn different mean and variance for each separate channel using style vector. If style vector belongs to W dimensional space (notation used in paper, see Figure 16), with affine mapping it is mapped to C dimensional space where C is the channel dimension.

3.2 Double Attention

In order to achieve larger receptive field and obtain improved generation quality, authors propose Double Attention mechanism.

Figure 18: Double Attention Mechanism

In Swin Transformer section, we saw that general repeated block consists of 2 Swin Transformers where the only difference is the type of attention being applied, i.e W-MSA & SW-MSA. This architecture choice needs 2 distinct transformer block whereas in StyleSwin, process modified in such a way that single transformer block uses both W-MSA and SW-MSA. Let’s see how it is done.

Equation 3: Attention heads divided into two: The first half computes W-MSA and the second half computes SW-MSA.

In Multi-Head Self Attention, N attention heads divided into two, the first half computes W-MSA and the second half computes SW-MSA. Then the result is concatenated and a projection matrix is applied to concatenated head outputs as an attempt to mix outputs.

Equation 4: Concatenation of head outputs is multiplied by projection matrix to mix head outputs.

4. Blocking Artifacts

Generator Architecture in Figure 16 yields state of the art results on 256 x 256 images. But when it comes to synthesis of 1024 x 1024 images blocking artifacts occur.

Figure 19: Blocking artifacts become obvious on 1024 × 1024 resolution. These artifacts correlate with the window size of local attentions.

Authors explains the artifact effects as the use of window based attention mechanism.

“ Hence, we are certain it is the window-wise processing that breaks the spatial coherency and causes the blocking artifacts.”

To strengthen this reasoning, they applied window based attention on 1D continuous data with strided windows. They adopt 1 Attention head and random projection matrix and found following:

Figure 20: A 1D example illustrates that the window-wise local attention causes blocking artifacts. (a) Input continuous signal along with partitioning windows. (b) Output discontinuous signal after window-wise attention.

4.1 Wavelet Discriminator

In Discriminator Overview section, we pointed out that Discriminator has an tremendous effect on training stability so the architecture is Convolution based rather than Transformer.

Figure 21: (a) Images with blocking artifacts. (b) The artifacts with periodic patterns can be clearly distinguished in the spectrum. (c) The spectrum of artifact-free images derived from the sliding window inference.

Blocking artifacts can be clearly seen in Fourier spectrum. Fourier spectrum shows that blocking artifacts follow a periodic pattern.

Figure 22: Wavelet Discriminator Architecture

Wavelet discriminator deals with blocking artifacts remarkably. Above you see the proposed discriminator architecture. The discriminator downsamples the input image by convolution operations and on each stage checks the frequency discrepancy relative to real images after discrete wavelet decomposition (DWT).

5. Conclusion

Our discussion on StyleSwin ends here. In this blog post, we reviewed the StyleSwin paper which obtains state of the art results on high resolution image generation task. Please follow Arxiv link to read the paper, and for official implementation you can follow Github link to source code.

Hope you enjoyed reading. As always, I welcome your feedback and if you have further questions, I can be reached on Twitter & LinkedIn. In the following posts I will continue to review CVPR 2022 papers.

--

--

Hıdır Yeşiltepe

Computer Engineering student @ Middle East Technical University/Turkey. Undergraduate Research Assistant @ University of Edinburgh.