本文将详细介绍“一张图片等于16x16个单词”中阐述的Vision Transformer(ViT),包括其开源代码和对各组件的概念解释。所有代码均使用PyTorch Python实现。
那么什么是Vision Transformer呢?正如“注意力就是一切”所介绍的,Transformer是一种利用注意力机制作为主要学习机制的机器学习模型。它迅速成为序列到序列任务(如语言翻译)的领先技术。
“一张图片等于16x16个单词”成功改进了[1]中所提出的Transformer,使其能够应对图像分类任务,从而催生了Vision Transformer(ViT)。ViT与[1]中的Transformer一样,基于注意力机制。不过,与用于NLP任务的Transformer包含编码器和解码器两个注意力分支不同,ViT仅使用编码。编码器的输出随后传递给神经网络“头”进行预测。
Tokens-to-Token ViT:Training Vision Transformers from Scratch on ImageNet则试图通过引入一种新颖的预处理方法,将输入图像转换为一系列token,从而消除这种预训练要求。有关此方法的更多信息,请查阅相关资料。在本文中,我们将重点讨论“一张图片等于16x16个单词”中的ViT。
本文遵循《一张图片等于16x16个单词》中概述的模型结构。然而,该论文的代码并未公开发布。最近的《Tokens-to-Token ViT》中的代码可在GitHub上找到。Tokens-to-Token ViT(T2T-ViT)模型在普通ViT骨干结构前添加了一个Tokens-to-Token(T2T)模块。本文中的代码基于《Tokens-to-Token ViT》GitHub代码中的ViT组件。本文对代码进行了修改,以允许非方形输入图像,并移除了dropout层。
ViT的第一步是从输入图像创建Token。Transformer操作的是一系列Token;在NLP中,这通常是一个句子的单词。对于计算机视觉来说,如何将输入分段成Token并不太明确。 ViT将图像转换为Token,以便每个Token表示图像的一个局部区域(或补丁)。他们描述了如何将高度H、宽度W和通道数C的图像重新塑造为N个补丁大小为P的Token:
每个Token的长度为P²*C。 让我们以此像素艺术《黄昏下的山》(作者Luis Zuno)为例进行补丁Token化。原始艺术品已被裁剪并转换为单通道图像。这意味着每个像素的值在0到1之间。单通道图像通常以灰度显示,但我们将以紫色配色方案显示它,因为这样更容易看到。请注意,补丁Token化不包括在[3]相关的代码中。
mountains = np.load(os.path.join(figure_path, 'mountains.npy'))
H = mountains.shape[0]
W = mountains.shape[1]
print('Mountain at Dusk is H =', H, 'and W =', W, 'pixels.')
fig = plt.figure(figsize=(10,6))
plt.imshow(mountains, cmap='Purples_r')
plt.xticks(np.arange(-0.5, W+1, 10), labels=np.arange(0, W+1, 10))
plt.yticks(np.arange(-0.5, H+1, 10), labels=np.arange(0, H+1, 10))
plt.clim([0,1])cbar_ax = fig.add_axes([0.95, .11, 0.05, 0.77])
plt.clim([0, 1])
#plt.savefig(os.path.join(figure_path, 'mountains.png'))
Mountain at Dusk is H = 60 and W = 100 pixels.
P = 20
N = int((H*W)/(P**2))
print('There will be', N, 'patches, each', P, 'by', str(P)+'.')
fig = plt.figure(figsize=(10,6))
plt.imshow(mountains, cmap='Purples_r')
plt.hlines(np.arange(P, H, P)-0.5, -0.5, W-0.5, color='w')
plt.vlines(np.arange(P, W, P)-0.5, -0.5, H-0.5, color='w')
plt.xticks(np.arange(-0.5, W+1, 10), labels=np.arange(0, W+1, 10))
plt.yticks(np.arange(-0.5, H+1, 10), labels=np.arange(0, H+1, 10))
x_text = np.tile(np.arange(9.5, W, P), 3)
y_text = np.repeat(np.arange(9.5, H, P), 5)
for i in range(1, N+1):
plt.text(x_text[i-1], y_text[i-1], str(i), color='w', fontsize='xx-large', ha='center')
plt.text(x_text[2], y_text[2], str(3), color='k', fontsize='xx-large', ha='center');
#plt.savefig(os.path.join(figure_path, 'mountain_patches.png'), bbox_inches='tight'
There will be 15 patches, each 20 by 20.
print('Each patch will make a token of length', str(P**2)+'.')
patch12 = mountains[40:60, 20:40]
token12 = patch12.reshape(1, P**2)
fig = plt.figure(figsize=(10,1))
plt.imshow(token12, aspect=10, cmap='Purples_r')
plt.xticks(np.arange(-0.5, 401, 50), labels=np.arange(0, 401, 50))
#plt.savefig(os.path.join(figure_path, 'mountain_token12.png'), bbox_inches='tight')
Each patch will make a token of length 400.
class Patch_Tokenization(nn.Module):
def __init__(self,
img_size: tuple[int, int, int]=(1, 1, 60, 100),
patch_size: int=50,
token_len: int=768):
""" Patch Tokenization Module
img_size (tuple[int, int, int]): size of input (channels, height, width)
patch_size (int): the side length of a square patch
token_len (int): desired length of an output token
## Defining Parameters
self.img_size = img_size
C, H, W = self.img_size
self.patch_size = patch_size
self.token_len = token_len
assert H % self.patch_size == 0, 'Height of image must be evenly divisible by patch size.'
assert W % self.patch_size == 0, 'Width of image must be evenly divisible by patch size.'
self.num_tokens = (H / self.patch_size) * (W / self.patch_size)
## Defining Layers
self.split = nn.Unfold(kernel_size=self.patch_size, stride=self.patch_size, padding=0)
self.project = nn.Linear((self.patch_size**2)*C, token_len)
def forward(self, x):
x = self.split(x).transpose(1,0)
x = self.project(x)
return x
我们将使用我们裁剪的单通道版本的Mountain at Dusk⁴来运行此代码的示例。我们应该看到与之前相同的Token数量和初始Token大小的值。我们将使用token_len=768作为投影长度,这是基本变体的ViT²的大小。
下面代码块中的第一行是将Mountain at Dusk⁴的数据类型从NumPy数组更改为Torch张量。我们还必须对张量进行unsqueeze⁶操作,以创建一个通道维度和一个批处理大小维度。与上面一样,我们只有一个通道。由于只有一个图像,批处理大小为1。
x = torch.from_numpy(mountains).unsqueeze(0).unsqueeze(0).to(torch.float32)
token_len = 768
print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of input channels:', x.shape[1], '\n\timage size:', (x.shape[2], x.shape[3]))
# Define the Module
patch_tokens = Patch_Tokenization(img_size=(x.shape[1], x.shape[2], x.shape[3]),
patch_size = P,
token_len = token_len)
Input dimensions are batchsize: 1 number of input channels: 1 image size: (60, 100)
x = patch_tokens.split(x).transpose(2,1)
print('After patch tokenization, dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])
After patch tokenization, dimensions are batchsize: 1 number of tokens: 15 token length: 400
x = patch_tokens.split(x).transpose(2,1)
print('After patch tokenization, dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])
After patch tokenization, dimensions are batchsize: 1 number of tokens: 15 token length: 400
第一步是在图像Token之前添加一个空白Token,称为Prediction Token。此Token将用于输出编码块以进行预测。它最初是空白的 —— 等效于零 —— 这样它就可以从其他图像Token中获取信息。
# Define an Input
num_tokens = 175
token_len = 768
batch = 13
x = torch.rand(batch, num_tokens, token_len)
print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:',
x.shape[1], '\n\ttoken length:', x.shape[2])
# Append a Prediction Tokenpred_token = torch.zeros(1, 1, token_len).expand(batch, -1, -1)
print('Prediction Token dimensions are\n\tbatchsize:', pred_token.shape[0], '\n\tnumber of tokens:', pred_token.shape[1], '\n\ttoken length:', pred_token.shape[2])
x = torch.cat((pred_token, x), dim=1)
print('Dimensions with Prediction Token are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])
Input dimensions are batchsize: 13 number of tokens: 175 token length: 768 Prediction Token dimensions are batchsize: 13 number of tokens: 1 token length: 768 Dimensions with Prediction Token are batchsize: 13 number of tokens: 176 token length: 768
def get_sinusoid_encoding(num_tokens, token_len):
""" Make Sinusoid Encoding Table
num_tokens (int): number of tokens
token_len (int): length of a token
(torch.FloatTensor) sinusoidal position encoding table
def get_position_angle_vec(i):
return [i / np.power(10000, 2 * (j // 2) / token_len) for j in range(token_len)]
sinusoid_table = np.array([get_position_angle_vec(i) for i in range(num_tokens)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])
return torch.FloatTensor(sinusoid_table).unsqueeze(0)
PE = get_sinusoid_encoding(num_tokens+1, token_len)
print('Position embedding dimensions are\n\tnumber of tokens:', PE.shape[1], '\n\ttoken length:', PE.shape[2])
x = x + PE
print('Dimensions with Position Embedding are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])
Position embedding dimensions are umber of tokens: 176 token length: 768 Dimensions with Position Embedding are batchsize: 13 number of tokens: 176 token length: 768
class Encoding(nn.Module):
def __init__(self,
dim: int,
num_heads: int=1,
hidden_chan_mul: float=4.,
qkv_bias: bool=False,
qk_scale: NoneFloat=None,
Encoding Block
dim (int): size of a single token
num_heads(int): number of attention heads in MSA hidden_chan_mul (float): multiplier to determine the number of hidden channels (features) in the NeuralNet component
qkv_bias (bool): determines if the qkv layer learns an addative bias qk_scale (NoneFloat): value to scale the queries and keys by; if None, queries and keys are scaled by ``head_dim ** -0.5`` act_layer(nn.modules.activation): torch neural network layer class to use as activation norm_layer(nn.modules.normalization): torch neural network layer class to use as normalization
## Define Layers
self.norm1 = norm_layer(dim)
self.attn = Attention(dim=dim,
self.norm2 = norm_layer(dim)
self.neuralnet = NeuralNet(in_chan=dim, hidden_chan=int(dim*hidden_chan_mul),
def forward(self, x):
x = x + self.attn(self.norm1(x))
x = x + self.neuralnet(self.norm2(x))
return x
num_heads 、qkv_bias和qk_scale参数定义了注意力模块组件。关于视觉转换器的注意力的深入研究留待下次再讨论。
hidden_ chan_mul和act_layer参数定义神经网络模块组件。激活层可以是任意⁷层。我们稍后torch.nn.modules.activation会详细介绍神经网络模块。
可以从任意⁸层中选择norm_layer torch.nn.modules.normalization。
现在,我们将逐步介绍图中的每个蓝色块及其附带的代码。我们将使用长度为 768 的 176 个标记。我们将使用批处理大小 13,因为它是素数,不会与任何其他参数混淆。我们将使用 4 个注意力头,因为它可以均匀划分标记长度;但是,您不会在编码块中看到注意力头维度。
# Define an Inputnum_tokens = 176token_len = 768batch = 13heads = 4x = torch.rand(batch, num_tokens, token_len)print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])# Define the ModuleE = Encoding(dim=token_len, num_heads=heads, hidden_chan_mul=1.5, qkv_bias=False, qk_scale=None, act_layer=nn.GELU, norm_layer=nn.LayerNorm)E.eval();
Input dimensions are batchsize: 13 number of tokens: 176 token length: 768
y = E.norm1(x)
print('After norm, dimensions are\n\tbatchsize:', y.shape[0], '\n\tnumber of tokens:', y.shape[1], '\n\ttoken size:', y.shape[2])y = E.attn(y)print('After attention, dimensions are\n\tbatchsize:', y.shape[0], '\n\tnumber of tokens:', y.shape[1], '\n\ttoken size:', y.shape[2])y = y + xprint('After split connection, dimensions are\n\tbatchsize:', y.shape[0], '\n\tnumber of tokens:', y.shape[1], '\n\ttoken size:', y.shape[2])
After norm, dimensions are batchsize: 13 number of tokens: 176 token size: 768 After attention, dimensions are batchsize: 13 number of tokens: 176 token size: 768 After split connection, dimensions are batchsize: 13 number of tokens: 176 token size: 768
z = E.norm2(y)
print('After norm, dimensions are\n\tbatchsize:', z.shape[0], '\n\tnumber of tokens:', z.shape[1], '\n\ttoken size:', z.shape[2])
z = E.neuralnet(z)
print('After neural net, dimensions are\n\tbatchsize:', z.shape[0], '\n\tnumber of tokens:', z.shape[1], '\n\ttoken size:', z.shape[2])
z = z + y
print('After split connection, dimensions are\n\tbatchsize:', z.shape[0], '\n\tnumber of tokens:', z.shape[1], '\n\ttoken size:', z.shape[2])
After norm, dimensions are batchsize: 13 number of tokens: 176 token size: 768 After neural net, dimensions are batchsize: 13 number of tokens: 176 token size: 768 After split connection, dimensions are batchsize: 13 number of tokens: 176 token size: 768
class NeuralNet(nn.Module):
def __init__(self,
in_chan: int,
hidden_chan: NoneFloat=None,
out_chan: NoneFloat=None,
act_layer = nn.GELU):
""" Neural Network Module
in_chan (int): number of channels (features) at input
hidden_chan (NoneFloat): number of channels (features) in the hidden layer;
if None, number of channels in hidden layer is the same as the number of input channels
out_chan (NoneFloat): number of channels (features) at output;
if None, number of output channels is same as the number of input channels
act_layer(nn.modules.activation): torch neural network layer class to use as activation
## Define Number of Channels
hidden_chan = hidden_chan or in_chan
out_chan = out_chan or in_chan
## Define Layers
self.fc1 = nn.Linear(in_chan, hidden_chan)
self.act = act_layer()
self.fc2 = nn.Linear(hidden_chan, out_chan)
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.fc2(x)
return x
# Define an Input
num_tokens = 176
token_len = 768
batch = 1
x = torch.rand(batch, num_tokens, token_len)
print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])
Input dimensions are batchsize: 1 number of tokens: 176 token length: 768
norm = nn.LayerNorm(token_len)
x = norm(x)
print('After norm, dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken size:', x.shape[2])
After norm, dimensions are batchsize: 1 number of tokens: 1001 token size: 768
norm = nn.LayerNorm(token_len)
pred_token = x[:, 0]
print('Length of prediction token:', pred_token.shape[-1])
Length of prediction token: 768
最后,将预测Token传递到头部以进行预测。头部通常是某种类型的神经网络,根据模型的不同而变化。在An Image is Worth 16x16 Words²中,他们在预训练期间使用具有一个隐藏层的MLP(多层感知器),在微调期间使用单个线性层。在Tokens-to-Token ViT³中,他们使用单个线性层作为头部。此示例将使用输出形状为1,以表示单个估计回归值。
head = nn.Linear(token_len, 1)
pred = head(pred_token)
print('Length of prediction:', (pred.shape[0], pred.shape[1]))
print('Prediction:', float(pred))
Length of prediction: (1, 1) Prediction: -0.5474240779876709
为了创建完整的ViT模块,我们使用上面定义的Patch Tokenization模块和ViT Backbone模块。ViT Backbone如下所定义,包含了Token处理、编码块和预测处理组件。
class ViT_Backbone(nn.Module):
def __init__(self,
preds: int=1,
token_len: int=768,
num_heads: int=1,
Encoding_hidden_chan_mul: float=4.,
depth: int=12,
""" VisTransformer Backbone
preds (int): number of predictions to output
token_len (int): length of a token
num_heads(int): number of attention heads in MSA
Encoding_hidden_chan_mul (float): multiplier to determine the number of hidden channels (features) in the NeuralNet component of the Encoding Module
depth (int): number of encoding blocks in the model
qkv_bias (bool): determines if the qkv layer learns an addative bias
qk_scale (NoneFloat): value to scale the queries and keys by;
if None, queries and keys are scaled by ``head_dim ** -0.5``
act_layer(nn.modules.activation): torch neural network layer class to use as activation
norm_layer(nn.modules.normalization): torch neural network layer class to use as normalization
## Defining Parameters
self.num_heads = num_heads
self.Encoding_hidden_chan_mul = Encoding_hidden_chan_mul
self.depth = depth
## Defining Token Processing Components
self.cls_token = nn.Parameter(torch.zeros(1, 1, self.token_len))
self.pos_embed = nn.Parameter(data=get_sinusoid_encoding(num_tokens=self.num_tokens+1, token_len=self.token_len), requires_grad=False)
## Defining Encoding blocks
self.blocks = nn.ModuleList([Encoding(dim = self.token_len,
num_heads = self.num_heads,
hidden_chan_mul = self.Encoding_hidden_chan_mul,
qkv_bias = qkv_bias,
qk_scale = qk_scale,
act_layer = act_layer,
norm_layer = norm_layer)
for i in range(self.depth)])
## Defining Prediction Processing
self.norm = norm_layer(self.token_len)
self.head = nn.Linear(self.token_len, preds)
## Make the class token sampled from a truncated normal distrobution
timm.layers.trunc_normal_(self.cls_token, std=.02)
def forward(self, x):
## Assumes x is already tokenized
## Get Batch Size
B = x.shape[0]
## Concatenate Class Token
x = torch.cat((self.cls_token.expand(B, -1, -1), x), dim=1)
## Add Positional Embedding
x = x + self.pos_embed
## Run Through Encoding Blocks
for blk in self.blocks:
x = blk(x)
## Take Norm
x = self.norm(x)
## Make Prediction on Class Token
x = self.head(x[:, 0])
return x
通过ViT Backbone模块,我们可以定义完整的ViT模型。
class ViT_Model(nn.Module):
def __init__(self,
img_size: tuple[int, int, int]=(1, 400, 100),
patch_size: int=50,
token_len: int=768,
preds: int=1,
num_heads: int=1,
Encoding_hidden_chan_mul: float=4.,
depth: int=12,
""" VisTransformer Model
img_size (tuple[int, int, int]): size of input (channels, height, width)
patch_size (int): the side length of a square patch
token_len (int): desired length of an output token
preds (int): number of predictions to output
num_heads(int): number of attention heads in MSA
Encoding_hidden_chan_mul (float): multiplier to determine the number of hidden channels (features) in the NeuralNet component of the Encoding Module
depth (int): number of encoding blocks in the model
qkv_bias (bool): determines if the qkv layer learns an addative bias
qk_scale (NoneFloat): value to scale the queries and keys by;
if None, queries and keys are scaled by ``head_dim ** -0.5``
act_layer(nn.modules.activation): torch neural network layer class to use as activation
norm_layer(nn.modules.normalization): torch neural network layer class to use as normalization
## Defining Parameters
self.img_size = img_size
C, H, W = self.img_size
self.patch_size = patch_size
self.token_len = token_len
self.num_heads = num_heads
self.Encoding_hidden_chan_mul = Encoding_hidden_chan_mul
self.depth = depth
## Defining Patch Embedding Module
self.patch_tokens = Patch_Tokenization(img_size,
## Defining ViT Backbone
self.backbone = ViT_Backbone(preds,
## Initialize the Weights
def _init_weights(self, m):
""" Initialize the weights of the linear layers & the layernorms
## For Linear Layers
if isinstance(m, nn.Linear):
## Weights are initialized from a truncated normal distrobution
timm.layers.trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
## If bias is present, bias is initialized at zero
nn.init.constant_(m.bias, 0)
## For Layernorm Layers
elif isinstance(m, nn.LayerNorm):
## Weights are initialized at one
nn.init.constant_(m.weight, 1.0)
## Bias is initialized at zero
nn.init.constant_(m.bias, 0)
@torch.jit.ignore ##Tell pytorch to not compile as TorchScript
def no_weight_decay(self):
""" Used in Optimizer to ignore weight decay in the class token
return {'cls_token'}
def forward(self, x):
x = self.patch_tokens(x)
x = self.backbone(x)
return x
在ViT模型中,img_size、patch_size和token_len定义了Patch Tokenization模块。它们分别表示输入图像的大小、切分成的Patch的大小,以及由此生成的token序列的长度。正是通过这个模块,ViT将图像转化为模型能够处理的token序列。num_heads决定了多头注意力机制中“头”的数量;Encoding_hidden_channel_mul用于调整编码块的隐藏层通道数;qkv_bias和qk_scale则分别控制查询、键和值向量的偏置和缩放;而act_layer则代表激活函数层,我们可以选择任何torch.nn.modules.activation中的激活函数。此外,depth参数决定了模型中包含多少个这样的编码块。