首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >【论文精读 | Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake 】

【论文精读 | Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake 】

原创
作者头像
九年义务漏网鲨鱼
发布2025-07-15 21:46:43
发布2025-07-15 21:46:43
70700
代码可运行
举报
文章被收录于专栏:论文精读论文精读
运行总次数:0
代码可运行

Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

🔴 传统的检测器是通过简单的堆叠时空的空间、频率特征,导致了无法有效的检测到像素级的时间伪影。

🟢 作者针对时间伪影,提出利用像素级的时间频谱,在时间轴上采用一维的傅里叶变换。为了更好的定位到经常出现时间不自然的变换区域,作者提出了基于弱监督学习的方式,通过Attention Proposal Module (APM)来识别兴趣区域。

Preliminary Analysis

作者发现,在伪造视频中,眼睛区域的存在着波动现象,认为通过关注可能包含不一致的有限区域,可以更有效地检测到时间伪影。

作者在前期的实验中设计了ResNet-18作为分类器,时间频谱作为输入与spatial-temporal的模型进行对比,发现不同的伪造合成方法,存在波动的区域也不同;

Proposed Method

⚠️ 目前的代码实现并未开源,因此,本文的代码均为个人实现!

⛔ 请注意:目前的代码均在阶段测试,还未在实际应用中实现!

作者提出的架构中,包含了频率特征提取器以及联合transformer模块。其中,频率特征提取器用于提取时间频谱以及定位兴趣区域;联合transformer模块联合了多兴趣区域的时间频谱。

3.1 频率特征提取

Step1 作者首先通过中值滤波器对视频片段进行处理,来消除主导分量的影响:I=\text{gray}(I-median(I)) ,随后对每一个像素在时间维度上一维的傅里叶变换,由于每个值都是实值,作者忽略了幅度谱的对称部分,为此,得到的时间频谱为(\frac{T}{2} \times H \times W

代码语言:python
代码运行次数:0
运行
复制
import torch
import numpy as np
from kornia.filters import median_blur

def preprocess_video(video: torch.Tensor):
    """
    video: shape (T, H, W, C), C=3, RGB
    return: frequency representation of shape (T//2, H, W)
    """
    T, H, W, C = video.shape
    assert C == 3, "Input must be RGB"
    video = video.to(torch.float32)

    filtered_video = torch.zeros_like(video)
    for t in range(T):
        frame = video[t, :, :, :].permute(2, 0, 1)  # (C, H, W)
        filtered_frame = median_blur(frame.unsqueeze(0), kernel_size=3)  # 3x3中值滤波
        filtered_video[t, :, :, :] = filtered_frame[0].permute(1, 2, 0)  # 转回 (H, W, C)

    video_sub = video - filtered_video # (T, H, W, C)

    # Step 3: grayscale
    gray_video = (
        0.299 * video_sub[..., 0] +
        0.587 * video_sub[..., 1] +
        0.114 * video_sub[..., 2]
    )  # shape: (T, H, W)

    # Step 4: FFT along time (dim=0)
    fft_result = torch.fft.fft(gray_video, dim=0)  # (T, H, W), complex
    freq_rep = torch.abs(fft_result[:T // 2])  # keep first half (real spectrum)
    return freq_rep
# ==== Test ====
video = torch.randn((32, 224, 224, 3), dtype=torch.float16).cuda()
video = video.to(torch.float32)

freq_features = preprocess_video(video)
print(freq_features.shape)  # torch.Size([16, 224, 224])

Step2 在得到时间频谱后,作者将时间频谱输入到二维卷积网络中,在文章中采用ResNet-50作为二维网络,最终输出每一个块以及最后一个块的输出特征图

代码语言:python
代码运行次数:0
运行
复制
class ResNetBackbone(nn.Module):
    def __init__(self, input_channels=16):
        super().__init__()
        resnet = resnet50(pretrained=False)
        resnet.conv1 = nn.Conv2d(input_channels, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.layer0 = nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool)
        self.layer1 = resnet.layer1  # 56x56
        self.layer2 = resnet.layer2  # 28x28
        self.layer3 = resnet.layer3  # 14x14
        self.layer4 = resnet.layer4  # 7x7

    def forward(self, x):
        z0 = []
        x = self.layer0(x); z0.append(x)
        x = self.layer1(x); z0.append(x)
        x = self.layer2(x); z0.append(x)
        x = self.layer3(x); z0.append(x)
        x = self.layer4(x); z0.append(x)
        return x, z0  # list of feature maps

ResNet2d = ResNetBackbone(16).cuda()
Z0, z = ResNet2d(freq_features)  
print(Z0.shape) # torch.Size([1, 2048, 7, 7])

Step3 作者在ResNet的每一个块中提取出特征图信息,也就是上一个代码块中的z,并设计了APM模块对时间频谱$F_0$和每个块的输出特征图提取五个兴趣区域:[A, B] = APM(F_0, z_0^{(i)})

APM生成了五个区域的坐标mask来提取兴趣区域得到时间频谱F_p=F_0 \odot M_p .其中M_p 存储的是每个区域的的中心点的位置。

  • 由于patch块是中间模块的输出,先随机定义一下,文中定义块的大小为88,因此,随机初始化F_p 为:
代码语言:python
代码运行次数:0
运行
复制
patches = torch.randn((5,16,88,88), dtype=torch.float16).cuda()

👉 将上述得到的patch块特征,再分别输入到共享权重的ResNet-50网络中,得到局部频域特征$Z_p$,需要注意的是,$Z_p$的维度$Z_0$并不相同

Block

F₀ 输出大小 (224x224)

Fₚ 输出大小 (88x88)

Layer0

112×112

44×44

Layer1

56×56

22×22

Layer2

28×28

11×11

Layer3

14×14

5×5

Layer4

7×7

2×2

3.2 联合Transformer

Step1 全局的频率特征Z_0 和局部频率特征Z_p 分别通过了$1 \times 1$的卷积核,并相加求和,所有的卷积核初始值为0 :\sum Conv_{1 \times 1}(z_{i}) + Conv(Z_p) ,最后对结果再通过两层$1 \times 1$的卷积核,均采用ReLU激活函数;

代码语言:python
代码运行次数:0
运行
复制
import torch
import torch.nn as nn
import torch.nn.functional as F

class ConvFuser(nn.Module):
    def __init__(self, channels, num_parts=5):
        super().__init__()
        self.C = channels
        self.num_parts = num_parts

        # 1×1 conv for z₀⁽ⁱ⁾
        self.conv0 = nn.Conv2d(self.C, self.C, kernel_size=1)

        # 1×1 conv for each zₚ⁽ⁱ⁾
        self.convp = nn.ModuleList([
            nn.Conv2d(self.C, self.C, kernel_size=1) for _ in range(self.num_parts)
        ])

        # Convf: 2-layer conv, 中间降维,后恢复通道,第二层权重初始化为 0
        mid = self.C // 2
        self.convf1 = nn.Conv2d(self.C, mid, kernel_size=1)
        self.convf2 = nn.Conv2d(mid, self.C, kernel_size=1)
        nn.init.zeros_(self.convf2.weight)
        nn.init.zeros_(self.convf2.bias)

    def forward(self, z0_i, zp_i_list):
        """
        z0_i: tensor of shape (B, C, H, W)
        zp_i_list: list of 5 tensors, each of shape (B, C, h, w)
        """
        B, C, H, W = z0_i.shape
        assert len(zp_i_list) == self.num_parts

        # Conv0 on global
        z0_conv = self.conv0(z0_i)

        # Interpolate + Conv on each zp
        zp_conv_sum = 0
        for i in range(self.num_parts):
            zp = zp_i_list[i]
            zp_up = F.interpolate(zp, size=(H, W), mode='bilinear', align_corners=False)
            zp_conv = self.convp[i](zp_up)
            zp_conv_sum += zp_conv

        fused = z0_conv + zp_conv_sum  # shape: (B, C, H, W)

        # Apply Convf: 2-layer conv + ReLU
        out = self.convf2(F.relu(self.convf1(fused)))

        return out  # shape: (B, C, H, W)
fuser = ConvFuser(channels=2048)

z0_i = torch.randn(1, 2048, 7, 7)  # global
zp_i_list = [torch.randn(1, 2048, 2, 2) for _ in range(5)]  # 5 parts-from Fp→ResNet

# 融合
z_fused = fuser(z0_i, zp_i_list)  # shape: (1, 2048, 7, 7)
print(z_fused.shape)

Step2 将原始的视频帧数据传到3D-Resenet网络中,在不同块的输出中加入z_fused混合特征,得到Blended Feature Z^+

Step3 为了更好的结合全局和局部的关系,作者基于每个特征在二维卷积、三维卷积的最后一个块的输出,设计了STE来结合局部频率特征和空间伪影特征,利用时序变压器编码器(TTE)利用变压器将时序频率特征与时序上下文信息结合起来,促进对时序工件的更全面的理解。

代码语言:python
代码运行次数:0
运行
复制
import torch
import torch.nn as nn
import torch.nn.functional as F


class LinearProjector(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.linear = nn.Linear(in_dim, out_dim)

    def forward(self, x):  # x: (B, C, H, W)
        B, C, H, W = x.shape
        x = x.view(B, C, -1).permute(0, 2, 1)  # (B, H*W, C)
        return self.linear(x)  # (B, H*W, out_dim)


class TransformerEncoderBlock(nn.Module):
    def __init__(self, dim, num_heads=8, dropout=0.1):
        super().__init__()
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=dim,
            nhead=num_heads,
            dim_feedforward=dim * 4,
            dropout=dropout,
            activation='relu',
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=1)

    def forward(self, x):  # (B, N, C)
        return self.encoder(x)


class FinalIntegrationHead(nn.Module):
    def __init__(self, dim_spatial=1024, dim_temporal=1024, num_classes=1):
        super().__init__()
        self.classifier = nn.Sequential(
            nn.Linear(dim_spatial + dim_temporal, 512),
            nn.ReLU(),
            nn.Linear(512, num_classes)
        )

    def forward(self, f_spatial, f_temporal):  # (B, C), (B, C)
        fused = torch.cat([f_spatial, f_temporal], dim=1)
        return self.classifier(fused)  # (B, num_classes)


class TransformerIntegrationModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.proj_z0 = LinearProjector(in_dim=2048, out_dim=1024)
        self.proj_zp = LinearProjector(in_dim=2048, out_dim=1024)
        self.ste = TransformerEncoderBlock(dim=1024)
        self.tte = TransformerEncoderBlock(dim=1024)
        self.head = FinalIntegrationHead()

    def forward(self, Z_plus4, z0_4, zp4_list):
        B = Z_plus4.size(0)

        # --- Temporal Transformer ---
        z0_proj = self.proj_z0(z0_4)  # (B, 49, 1024)
        zplus_flat = Z_plus4.flatten(3).permute(0, 2, 1, 3).reshape(B, -1, 1024)  # (B, 16*14*14, 1024)
        tte_input = torch.cat([zplus_flat, z0_proj], dim=1)  # (B, N, 1024)
        f_temporal = self.tte(tte_input).mean(dim=1)  # (B, 1024)

        # --- Spatial Transformer ---
        zp_proj_list = [self.proj_zp(zp) for zp in zp4_list]  # each: (B, 49, 1024)
        zp_cat = torch.cat(zp_proj_list, dim=1)  # (B, 245, 1024)
        ste_input = torch.cat([zplus_flat, zp_cat], dim=1)  # (B, N+245, 1024)
        f_spatial = self.ste(ste_input).mean(dim=1)  # (B, 1024)

        # --- Final Classifier ---
        y_hat = self.head(f_spatial, f_temporal)  # (B, 1)
        return y_hat


# ==== 模拟输入 ====
B = 2  # batch size
z0_4 = torch.randn(B, 2048, 7, 7)                     # 全局频率特征
zp4_list = [torch.randn(B, 2048, 7, 7) for _ in range(5)]  # 5个局部频率特征
Z_plus4 = torch.randn(B, 1024, 16, 14, 14)            # 3D ResNet融合特征

# ==== 构建模型并前向 ====
model = TransformerIntegrationModule()
y_hat = model(Z_plus4, z0_4, zp4_list)

print("y_hat shape:", y_hat.shape)  # should be (B, 1)

Experiments

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection
    • Preliminary Analysis
    • Proposed Method
      • 3.1 频率特征提取
      • 3.2 联合Transformer
    • Experiments
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档