Transformer中的FPN-Swin Transformer

YoungTimes

发布于 2023-09-01 08:57:43

1.2K0

Transformer从NLP领域迁移到Vision领域，要解决几个主要问题：1) 尺度问题。同样的物体在同一张图像中的尺寸会有差异；2) 图像的分辨率问题。分辨率太大，直接用Transformer处理的计算代价太大。

“Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

Swin Transformer VS VIT

在VIT中，Transformer生成的Feature Map是单一固定分辨率，并且由于对整张图片计算Self-Attention，因此它的计算复杂度随着输入图片大小的增加而平方级增加。

Swin Transformer只针对单个Local Window计算Self-Attention，并且每个Local windows的大小是固定的，因此它的计算复杂度与输入图片大小是线性关系；

不同层Layer之间类似于Pooling的Patch-Merging操作以及同层内的Local Window shift操作，使得Swin Transformer也具备类似于FPN的局部和全局的多尺度对象的建模能力。

Overall Architecture

Swin Transformer的网络结构如下图所示。

首先它将Image(HxWx3)切分成一个个的小Patch，论文中每个Patch的大小是4x4，切分后的Patch维度为(H/4, W/4, 48=4x4x3)；

然后，切分后的图像经过线性投射层(Linear Embedding Layer)将维度转换为(H/4, W/4, C)；

之后，再经过Swin Transformer Block处理后，进行Patch Merging操作，再进入下一个Block；

Patch Merging

Patch Merging的具体做法如下:

“patch merging layer concatenates the features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2×2 = 4 (2× downsampling of resolution), and the output dimension is set to 2C.