🔴 传统的检测器是通过简单的堆叠时空的空间、频率特征,导致了无法有效的检测到像素级的时间伪影。
🟢 作者针对时间伪影,提出利用像素级的时间频谱,在时间轴上采用一维的傅里叶变换。为了更好的定位到经常出现时间不自然的变换区域,作者提出了基于弱监督学习的方式,通过Attention Proposal Module (APM)
来识别兴趣区域。
作者发现,在伪造视频中,眼睛区域的存在着波动现象,认为通过关注可能包含不一致的有限区域,可以更有效地检测到时间伪影。
作者在前期的实验中设计了ResNet-18作为分类器,时间频谱作为输入与spatial-temporal的模型进行对比,发现不同的伪造合成方法,存在波动的区域也不同;
⚠️ 目前的代码实现并未开源,因此,本文的代码均为个人实现!
⛔ 请注意:目前的代码均在阶段测试,还未在实际应用中实现!
作者提出的架构中,包含了频率特征提取器以及联合transformer模块。其中,频率特征提取器用于提取时间频谱以及定位兴趣区域;联合transformer模块联合了多兴趣区域的时间频谱。
Step1
作者首先通过中值滤波器对视频片段进行处理,来消除主导分量的影响:I=\text{gray}(I-median(I)) ,随后对每一个像素在时间维度上一维的傅里叶变换,由于每个值都是实值,作者忽略了幅度谱的对称部分,为此,得到的时间频谱为(\frac{T}{2} \times H \times W )
import torch
import numpy as np
from kornia.filters import median_blur
def preprocess_video(video: torch.Tensor):
"""
video: shape (T, H, W, C), C=3, RGB
return: frequency representation of shape (T//2, H, W)
"""
T, H, W, C = video.shape
assert C == 3, "Input must be RGB"
video = video.to(torch.float32)
filtered_video = torch.zeros_like(video)
for t in range(T):
frame = video[t, :, :, :].permute(2, 0, 1) # (C, H, W)
filtered_frame = median_blur(frame.unsqueeze(0), kernel_size=3) # 3x3中值滤波
filtered_video[t, :, :, :] = filtered_frame[0].permute(1, 2, 0) # 转回 (H, W, C)
video_sub = video - filtered_video # (T, H, W, C)
# Step 3: grayscale
gray_video = (
0.299 * video_sub[..., 0] +
0.587 * video_sub[..., 1] +
0.114 * video_sub[..., 2]
) # shape: (T, H, W)
# Step 4: FFT along time (dim=0)
fft_result = torch.fft.fft(gray_video, dim=0) # (T, H, W), complex
freq_rep = torch.abs(fft_result[:T // 2]) # keep first half (real spectrum)
return freq_rep
# ==== Test ====
video = torch.randn((32, 224, 224, 3), dtype=torch.float16).cuda()
video = video.to(torch.float32)
freq_features = preprocess_video(video)
print(freq_features.shape) # torch.Size([16, 224, 224])
Step2
在得到时间频谱后,作者将时间频谱输入到二维卷积网络中,在文章中采用ResNet-50作为二维网络,最终输出每一个块以及最后一个块的输出特征图
class ResNetBackbone(nn.Module):
def __init__(self, input_channels=16):
super().__init__()
resnet = resnet50(pretrained=False)
resnet.conv1 = nn.Conv2d(input_channels, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.layer0 = nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool)
self.layer1 = resnet.layer1 # 56x56
self.layer2 = resnet.layer2 # 28x28
self.layer3 = resnet.layer3 # 14x14
self.layer4 = resnet.layer4 # 7x7
def forward(self, x):
z0 = []
x = self.layer0(x); z0.append(x)
x = self.layer1(x); z0.append(x)
x = self.layer2(x); z0.append(x)
x = self.layer3(x); z0.append(x)
x = self.layer4(x); z0.append(x)
return x, z0 # list of feature maps
ResNet2d = ResNetBackbone(16).cuda()
Z0, z = ResNet2d(freq_features)
print(Z0.shape) # torch.Size([1, 2048, 7, 7])
Step3
作者在ResNet的每一个块中提取出特征图信息,也就是上一个代码块中的z,并设计了APM模块对时间频谱$F_0$和每个块的输出特征图提取五个兴趣区域:[A, B] = APM(F_0, z_0^{(i)})
APM生成了五个区域的坐标mask来提取兴趣区域得到时间频谱F_p=F_0 \odot M_p .其中M_p 存储的是每个区域的的中心点的位置。
patches = torch.randn((5,16,88,88), dtype=torch.float16).cuda()
👉 将上述得到的patch块特征,再分别输入到共享权重的ResNet-50网络中,得到局部频域特征$Z_p$,需要注意的是,$Z_p$的维度$Z_0$并不相同
Block |
|
|
---|---|---|
Layer0 | 112×112 | 44×44 |
Layer1 | 56×56 | 22×22 |
Layer2 | 28×28 | 11×11 |
Layer3 | 14×14 | 5×5 |
Layer4 | 7×7 | 2×2 |
Step1
全局的频率特征Z_0 和局部频率特征Z_p 分别通过了$1 \times 1$的卷积核,并相加求和,所有的卷积核初始值为0 :\sum Conv_{1 \times 1}(z_{i}) + Conv(Z_p) ,最后对结果再通过两层$1 \times 1
$的卷积核,均采用ReLU激活函数;
import torch
import torch.nn as nn
import torch.nn.functional as F
class ConvFuser(nn.Module):
def __init__(self, channels, num_parts=5):
super().__init__()
self.C = channels
self.num_parts = num_parts
# 1×1 conv for z₀⁽ⁱ⁾
self.conv0 = nn.Conv2d(self.C, self.C, kernel_size=1)
# 1×1 conv for each zₚ⁽ⁱ⁾
self.convp = nn.ModuleList([
nn.Conv2d(self.C, self.C, kernel_size=1) for _ in range(self.num_parts)
])
# Convf: 2-layer conv, 中间降维,后恢复通道,第二层权重初始化为 0
mid = self.C // 2
self.convf1 = nn.Conv2d(self.C, mid, kernel_size=1)
self.convf2 = nn.Conv2d(mid, self.C, kernel_size=1)
nn.init.zeros_(self.convf2.weight)
nn.init.zeros_(self.convf2.bias)
def forward(self, z0_i, zp_i_list):
"""
z0_i: tensor of shape (B, C, H, W)
zp_i_list: list of 5 tensors, each of shape (B, C, h, w)
"""
B, C, H, W = z0_i.shape
assert len(zp_i_list) == self.num_parts
# Conv0 on global
z0_conv = self.conv0(z0_i)
# Interpolate + Conv on each zp
zp_conv_sum = 0
for i in range(self.num_parts):
zp = zp_i_list[i]
zp_up = F.interpolate(zp, size=(H, W), mode='bilinear', align_corners=False)
zp_conv = self.convp[i](zp_up)
zp_conv_sum += zp_conv
fused = z0_conv + zp_conv_sum # shape: (B, C, H, W)
# Apply Convf: 2-layer conv + ReLU
out = self.convf2(F.relu(self.convf1(fused)))
return out # shape: (B, C, H, W)
fuser = ConvFuser(channels=2048)
z0_i = torch.randn(1, 2048, 7, 7) # global
zp_i_list = [torch.randn(1, 2048, 2, 2) for _ in range(5)] # 5 parts-from Fp→ResNet
# 融合
z_fused = fuser(z0_i, zp_i_list) # shape: (1, 2048, 7, 7)
print(z_fused.shape)
Step2
将原始的视频帧数据传到3D-Resenet网络中,在不同块的输出中加入z_fused
混合特征,得到Blended Feature Z^+
Step3
为了更好的结合全局和局部的关系,作者基于每个特征在二维卷积、三维卷积的最后一个块的输出,设计了STE来结合局部频率特征和空间伪影特征,利用时序变压器编码器(TTE)利用变压器将时序频率特征与时序上下文信息结合起来,促进对时序工件的更全面的理解。
import torch
import torch.nn as nn
import torch.nn.functional as F
class LinearProjector(nn.Module):
def __init__(self, in_dim, out_dim):
super().__init__()
self.linear = nn.Linear(in_dim, out_dim)
def forward(self, x): # x: (B, C, H, W)
B, C, H, W = x.shape
x = x.view(B, C, -1).permute(0, 2, 1) # (B, H*W, C)
return self.linear(x) # (B, H*W, out_dim)
class TransformerEncoderBlock(nn.Module):
def __init__(self, dim, num_heads=8, dropout=0.1):
super().__init__()
encoder_layer = nn.TransformerEncoderLayer(
d_model=dim,
nhead=num_heads,
dim_feedforward=dim * 4,
dropout=dropout,
activation='relu',
batch_first=True
)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=1)
def forward(self, x): # (B, N, C)
return self.encoder(x)
class FinalIntegrationHead(nn.Module):
def __init__(self, dim_spatial=1024, dim_temporal=1024, num_classes=1):
super().__init__()
self.classifier = nn.Sequential(
nn.Linear(dim_spatial + dim_temporal, 512),
nn.ReLU(),
nn.Linear(512, num_classes)
)
def forward(self, f_spatial, f_temporal): # (B, C), (B, C)
fused = torch.cat([f_spatial, f_temporal], dim=1)
return self.classifier(fused) # (B, num_classes)
class TransformerIntegrationModule(nn.Module):
def __init__(self):
super().__init__()
self.proj_z0 = LinearProjector(in_dim=2048, out_dim=1024)
self.proj_zp = LinearProjector(in_dim=2048, out_dim=1024)
self.ste = TransformerEncoderBlock(dim=1024)
self.tte = TransformerEncoderBlock(dim=1024)
self.head = FinalIntegrationHead()
def forward(self, Z_plus4, z0_4, zp4_list):
B = Z_plus4.size(0)
# --- Temporal Transformer ---
z0_proj = self.proj_z0(z0_4) # (B, 49, 1024)
zplus_flat = Z_plus4.flatten(3).permute(0, 2, 1, 3).reshape(B, -1, 1024) # (B, 16*14*14, 1024)
tte_input = torch.cat([zplus_flat, z0_proj], dim=1) # (B, N, 1024)
f_temporal = self.tte(tte_input).mean(dim=1) # (B, 1024)
# --- Spatial Transformer ---
zp_proj_list = [self.proj_zp(zp) for zp in zp4_list] # each: (B, 49, 1024)
zp_cat = torch.cat(zp_proj_list, dim=1) # (B, 245, 1024)
ste_input = torch.cat([zplus_flat, zp_cat], dim=1) # (B, N+245, 1024)
f_spatial = self.ste(ste_input).mean(dim=1) # (B, 1024)
# --- Final Classifier ---
y_hat = self.head(f_spatial, f_temporal) # (B, 1)
return y_hat
# ==== 模拟输入 ====
B = 2 # batch size
z0_4 = torch.randn(B, 2048, 7, 7) # 全局频率特征
zp4_list = [torch.randn(B, 2048, 7, 7) for _ in range(5)] # 5个局部频率特征
Z_plus4 = torch.randn(B, 1024, 16, 14, 14) # 3D ResNet融合特征
# ==== 构建模型并前向 ====
model = TransformerIntegrationModule()
y_hat = model(Z_plus4, z0_4, zp4_list)
print("y_hat shape:", y_hat.shape) # should be (B, 1)
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。