
💡💡💡BRA问题点:由可变形点选择的键值对缺乏语义相关性。BiFormer中的查询感知稀疏注意力旨在让每个查询聚焦于top-k路由区域。然而,在计算注意力时,选定的键值对受到太多无关查询的影响,减弱了对更重要查询的注意力。
💡💡💡解决方案:为解决这些问题,我们提出了可变形双级路由注意力(DBRA)模块,该模块使用代理查询优化键值对的选择,并增强了注意力图中查询的解释性。
改进结构图如下:

1)作为注意力可变形双级路由注意力(DBRA)模块使用;
推荐指数:五星
DBRA | 亲测在多个数据集能够实现涨点,对标BRA。
《YOLOv13魔术师专栏》将从以下各个方向进行创新:

链接:
【原创自研模块】【多组合点优化】【注意力机制】【卷积魔改】【block&多尺度融合结合】【损失&IOU优化】【上下采样优化 】【小目标性能提升】【前沿论文分享】【训练实战篇】
订阅者通过添加WX: AI_CV_0624,入群沟通,提供改进结构图等一系列定制化服务。
定期向订阅者提供源码工程,配合博客使用。
订阅者可以申请发票,便于报销
💡💡💡为本专栏订阅者提供创新点改进代码,改进网络结构图,方便paper写作!!!
💡💡💡适用场景:红外、小目标检测、工业缺陷检测、医学影像、遥感目标检测、低对比度场景
💡💡💡适用任务:所有改进点适用【检测】、【分割】、【pose】、【分类】等
💡💡💡全网独家首发创新,【自研多个自研模块】,【多创新点组合适合paper 】!!!
☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️ ☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️
包含注意力机制魔改、卷积魔改、检测头创新、损失&IOU优化、block优化&多层特征融合、 轻量级网络设计、25年最新顶会改进思路、原创自研paper级创新等
🚀🚀🚀 本项目持续更新 | 更新完结保底≥80+ ,冲刺100+ 🚀🚀🚀
🍉🍉🍉 联系WX: AI_CV_0624 欢迎交流!🍉🍉🍉
⭐⭐⭐专栏原价299,越早订阅越划算⭐⭐⭐
💡💡💡 2025年计算机视觉顶会创新点适用于YOLOv12、YOLO11、YOLOv10、YOLOv8等各个YOLO系列,专栏文章提供每一步步骤和源码,轻松带你上手魔改网络 !!!
💡💡💡重点:通过本专栏的阅读,后续你也可以设计魔改网络,在网络不同位置(Backbone、head、detect、loss等)进行魔改,实现创新!!!
☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️ ☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️

论文:[2506.17733] YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception
摘要—YOLO 系列模型因其卓越的准确性和计算效率在实时目标检测领域占据主导地位。然而,无论是 YOLO11 及更早版本的卷积架构,还是 YOLOv12 引入的基于区域的自注意力机制,都仅限于局部信息聚合和成对相关性建模,缺乏捕捉全局多对多高阶相关性的能力,这限制了在复杂场景下的检测性能。本文提出了一种准确且轻量化的 YOLOv13 目标检测器。为应对上述挑战,我们提出了一种基于超图的自适应相关性增强(HyperACE)机制,通过超图计算自适应地利用潜在的高阶相关性,克服了以往方法仅基于成对相关性建模的限制,实现了高效的全局跨位置和跨尺度特征融合与增强。随后,我们基于 HyperACE 提出了全链路聚合与分配(FullPAD)范式,通过将相关性增强特征分配到整个网络,有效实现了全网的细粒度信息流和表征协同。最后,我们提出用深度可分离卷积代替常规的大核卷积,并设计了一系列块结构,在不牺牲性能的前提下显著降低了参数量和计算复杂度。我们在广泛使用的 MS COCO 基准测试上进行了大量实验,结果表明,我们的方法在参数更少、浮点运算量更少的情况下达到了最先进性能。具体而言,我们的 YOLOv13-N 相比 YOLO11-N 提升了 3.0% 的 mAP,相比 YOLOv12-N 提升了 1.5% 的 mAP。

以往的 YOLO 系列遵循 “骨干网络 → 颈部网络 → 检测头” 的计算范式,这本质上限定了信息流的充分传输。相比之下,我们的模型通过超图自适应关联增强(HyperACE)机制,实现全链路特征聚合与分配(FullPAD),从而增强传统的 YOLO 架构。因此,我们提出的方法在整个网络中实现了细粒度的信息流和表征协同,能够改善梯度传播并显著提升检测性能。具体而言,如图 2 所示,我们的 YOLOv13 模型首先使用类似以往工作的骨干网络提取多尺度特征图 B1、B2、B3、B4、B5,但其中的大核卷积被我们提出的轻量化 DS-C3k2 模块取代。然后,与传统 YOLO 方法直接将 B3、B4 和 B5 输入颈部网络不同,我们的方法将这些特征收集并传递到提出的 HyperACE 模块中,实现跨尺度跨位置特征的高阶关联自适应建模和特征增强。随后,我们的 FullPAD 范式利用三个独立通道,将关联增强后的特征分别分配到骨干网络与颈部网络的连接处、颈部网络的内部层以及颈部网络与检测头的连接处,以优化信息流。最后,颈部网络的输出特征图被传递到检测头中,实现多尺度目标检测。

ultralytics/cfg/models/v13/yolov13.yaml
超图自适应相关性增强机制 HyperACE

代码位置ultralytics/nn/modules/block.py
全流程聚合 - 分发范式 FullPAD

代码位置ultralytics/nn/modules/block.py
基于深度可分离卷积的轻量化模块

代码位置ultralytics/nn/modules/block.py

论文:https://arxiv.org/pdf/2410.08582
摘要:具有各种注意力模块的视觉变换器在视觉任务上展现出了卓越的性能。虽然在图像分类中使用稀疏自适应注意力(如DAT)取得了显著成果,但在针对语义分割任务进行微调时,由可变形点选择的键值对缺乏语义相关性。BiFormer中的查询感知稀疏注意力旨在让每个查询聚焦于top-k路由区域。然而,在计算注意力时,选定的键值对受到太多无关查询的影响,减弱了对更重要查询的注意力。为解决这些问题,我们提出了可变形双级路由注意力(DBRA)模块,该模块使用代理查询优化键值对的选择,并增强了注意力图中查询的解释性。基于此,我们引入了可变形双级路由注意力变换器(DeBiFormer),这是一种采用DBRA模块构建的新型通用视觉变换器。DeBiFormer已在包括图像分类、目标检测和语义分割在内的多种计算机视觉任务上得到验证,其有效性得到了强有力的证明。

为了提高查询的注意力效率,我们提出了可变形双级路由注意力(DBRA),这是一种用于视觉识别的注意力内注意力架构。在DBRA的过程中,第一个问题是如何定位可变形点。我们利用了[47]中的观察结果,即注意力有一个偏移网络,它以查询特征作为输入,并为所有参考点生成相应的偏移量。因此,候选的可变形点会向重要区域移动,具有高度的灵活性和效率,以便捕获更多信息丰富的特征。第二个问题是如何从语义相关的键值对中聚合信息,然后将这些信息广播回查询。因此,我们提出了一种注意力内注意力架构,该架构如上所述,向可变形点移动,作为查询的代理。当为可变形点选择键值对时,我们使用[56]中的观察结果,通过关注top-k路由区域来选择一小部分最语义相关的键值对,这些键值对是某个区域仅需要的。然后,在选择了语义相关的键值对之后,我们首先应用一个以可变形点查询为条件的token到token的注意力机制。接着,我们应用第二个token到token的注意力机制将信息广播回查询,在这个机制中,可变形点作为键值对被设计来代表一部分语义区域中最重要的点。

描述了一种名为"Deformable Bi-level Routing Attention"(可变形双级路由注意力)的详细架构。该架构包括以下几个部分:
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
from timm.models.registry import register_model
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import torchvision
from torch import Tensor
from typing import Tuple
import numbers
from timm.models.layers import to_2tuple, trunc_normal_
from einops import rearrange
import gc
import torch
import torch.nn as nn
from einops import rearrange
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from collections import OrderedDict
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from einops.layers.torch import Rearrange
from timm.models import register_model
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from timm.models.vision_transformer import _cfg
class DWConv(nn.Module):
def __init__(self, dim=768):
super(DWConv, self).__init__()
self.dwconv = nn.Conv2d(dim, dim, 3, 1, 1, bias=True, groups=dim)
def forward(self, x):
"""
x: NHWC tensor
"""
x = x.permute(0, 3, 1, 2) #NCHW
x = self.dwconv(x)
x = x.permute(0, 2, 3, 1) #NHWC
return x
class ConvFFN(nn.Module):
def __init__(self, dim=768):
super(DWConv, self).__init__()
self.dwconv = nn.Conv2d(dim, dim, 1, 1, 0)
def forward(self, x):
"""
x: NHWC tensor
"""
x = x.permute(0, 3, 1, 2) #NCHW
x = self.dwconv(x)
x = x.permute(0, 2, 3, 1) #NHWC
return x
class Attention(nn.Module):
"""
vanilla attention
"""
def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
# NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
self.scale = qk_scale or head_dim ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
"""
args:
x: NHWC tensor
return:
NHWC tensor
"""
_, H, W, _ = x.size()
x = rearrange(x, 'n h w c -> n (h w) c')
#######################################
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
#######################################
x = rearrange(x, 'n (h w) c -> n h w c', h=H, w=W)
return x
class AttentionLePE(nn.Module):
"""
vanilla attention
"""
def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., side_dwconv=5):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
# NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
self.scale = qk_scale or head_dim ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
self.lepe = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=1, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \
lambda x: torch.zeros_like(x)
def forward(self, x):
"""
args:
x: NHWC tensor
return:
NHWC tensor
"""
_, H, W, _ = x.size()
x = rearrange(x, 'n h w c -> n (h w) c')
#######################################
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)
lepe = self.lepe(rearrange(x, 'n (h w) c -> n c h w', h=H, w=W))
lepe = rearrange(lepe, 'n c h w -> n (h w) c')
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = x + lepe
x = self.proj(x)
x = self.proj_drop(x)
#######################################
x = rearrange(x, 'n (h w) c -> n h w c', h=H, w=W)
return x
class nchwAttentionLePE(nn.Module):
"""
Attention with LePE, takes nchw input
"""
def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., side_dwconv=5):
super().__init__()
self.num_heads = num_heads
self.head_dim = dim // num_heads
self.scale = qk_scale or self.head_dim ** -0.5
self.qkv = nn.Conv2d(dim, dim*3, kernel_size=1, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Conv2d(dim, dim, kernel_size=1)
self.proj_drop = nn.Dropout(proj_drop)
self.lepe = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=1, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \
lambda x: torch.zeros_like(x)
def forward(self, x:torch.Tensor):
"""
args:
x: NCHW tensor
return:
NCHW tensor
"""
B, C, H, W = x.size()
q, k, v = self.qkv.forward(x).chunk(3, dim=1) # B, C, H, W
attn = q.view(B, self.num_heads, self.head_dim, H*W).transpose(-1, -2) @ \
k.view(B, self.num_heads, self.head_dim, H*W)
attn = torch.softmax(attn*self.scale, dim=-1)
attn = self.attn_drop(attn)
# (B, nhead, HW, HW) @ (B, nhead, HW, head_dim) -> (B, nhead, HW, head_dim)
output:torch.Tensor = attn @ v.view(B, self.num_heads, self.head_dim, H*W).transpose(-1, -2)
output = output.permute(0, 1, 3, 2).reshape(B, C, H, W)
output = output + self.lepe(v)
output = self.proj_drop(self.proj(output))
return output
class TopkRouting(nn.Module):
"""
differentiable topk routing with scaling
Args:
qk_dim: int, feature dimension of query and key
topk: int, the 'topk'
qk_scale: int or None, temperature (multiply) of softmax activation
with_param: bool, wether inorporate learnable params in routing unit
diff_routing: bool, wether make routing differentiable
soft_routing: bool, wether make output value multiplied by routing weights
"""
def __init__(self, qk_dim, topk=4, qk_scale=None, param_routing=False, diff_routing=False):
super().__init__()
self.topk = topk
self.qk_dim = qk_dim
self.scale = qk_scale or qk_dim ** -0.5
self.diff_routing = diff_routing
# TODO: norm layer before/after linear?
self.emb = nn.Linear(qk_dim, qk_dim) if param_routing else nn.Identity()
# routing activation
self.routing_act = nn.Softmax(dim=-1)
def forward(self, query:Tensor, key:Tensor)->Tuple[Tensor]:
"""
Args:
q, k: (n, p^2, c) tensor
Return:
r_weight, topk_index: (n, p^2, topk) tensor
"""
if not self.diff_routing:
query, key = query.detach(), key.detach()
query_hat, key_hat = self.emb(query), self.emb(key) # per-window pooling -> (n, p^2, c)
attn_logit = (query_hat*self.scale) @ key_hat.transpose(-2, -1) # (n, p^2, p^2)
topk_attn_logit, topk_index = torch.topk(attn_logit, k=self.topk, dim=-1) # (n, p^2, k), (n, p^2, k)
r_weight = self.routing_act(topk_attn_logit) # (n, p^2, k)
return r_weight, topk_index
class KVGather(nn.Module):
def __init__(self, mul_weight='none'):
super().__init__()
assert mul_weight in ['none', 'soft', 'hard']
self.mul_weight = mul_weight
def forward(self, r_idx:Tensor, r_weight:Tensor, kv:Tensor):
"""
r_idx: (n, p^2, topk) tensor
r_weight: (n, p^2, topk) tensor
kv: (n, p^2, w^2, c_kq+c_v)
Return:
(n, p^2, topk, w^2, c_kq+c_v) tensor
"""
# select kv according to routing index
n, p2, w2, c_kv = kv.size()
topk = r_idx.size(-1)
# print(r_idx.size(), r_weight.size())
# FIXME: gather consumes much memory (topk times redundancy), write cuda kernel?
topk_kv = torch.gather(kv.view(n, 1, p2, w2, c_kv).expand(-1, p2, -1, -1, -1), # (n, p^2, p^2, w^2, c_kv) without mem cpy
dim=2,
index=r_idx.view(n, p2, topk, 1, 1).expand(-1, -1, -1, w2, c_kv) # (n, p^2, k, w^2, c_kv)
)
if self.mul_weight == 'soft':
topk_kv = r_weight.view(n, p2, topk, 1, 1) * topk_kv # (n, p^2, k, w^2, c_kv)
elif self.mul_weight == 'hard':
raise NotImplementedError('differentiable hard routing TBA')
# else: #'none'
# topk_kv = topk_kv # do nothing
return topk_kv
class QKVLinear(nn.Module):
def __init__(self, dim, qk_dim, bias=True):
super().__init__()
self.dim = dim
self.qk_dim = qk_dim
self.qkv = nn.Linear(dim, qk_dim + qk_dim + dim, bias=bias)
def forward(self, x):
q, kv = self.qkv(x).split([self.qk_dim, self.qk_dim+self.dim], dim=-1)
return q, kv
# q, k, v = self.qkv(x).split([self.qk_dim, self.qk_dim, self.dim], dim=-1)
# return q, k, v
class QKVConv(nn.Module):
def __init__(self, dim, qk_dim, bias=True):
super().__init__()
self.dim = dim
self.qk_dim = qk_dim
self.qkv = nn.Conv2d(dim, qk_dim + qk_dim + dim, 1, 1, 0)
def forward(self, x):
q, kv = self.qkv(x).split([self.qk_dim, self.qk_dim+self.dim], dim=1)
return q, kv
class BiLevelRoutingAttention(nn.Module):
"""
n_win: number of windows in one side (so the actual number of windows is n_win*n_win)
kv_per_win: for kv_downsample_mode='ada_xxxpool' only, number of key/values per window. Similar to n_win, the actual number is kv_per_win*kv_per_win.
topk: topk for window filtering
param_attention: 'qkvo'-linear for q,k,v and o, 'none': param free attention
param_routing: extra linear for routing
diff_routing: wether to set routing differentiable
soft_routing: wether to multiply soft routing weights
"""
def __init__(self, dim, num_heads=8, n_win=7, qk_dim=None, qk_scale=None,
kv_per_win=4, kv_downsample_ratio=4, kv_downsample_kernel=None, kv_downsample_mode='identity',
topk=4, param_attention="qkvo", param_routing=False, diff_routing=False, soft_routing=False, side_dwconv=3,
auto_pad=False):
super().__init__()
# local attention setting
self.dim = dim
self.n_win = n_win # Wh, Ww
self.num_heads = num_heads
self.qk_dim = qk_dim or dim
assert self.qk_dim % num_heads == 0 and self.dim % num_heads==0, 'qk_dim and dim must be divisible by num_heads!'
self.scale = qk_scale or self.qk_dim ** -0.5
################side_dwconv (i.e. LCE in ShuntedTransformer)###########
self.lepe = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=1, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \
lambda x: torch.zeros_like(x)
################ global routing setting #################
self.topk = topk
self.param_routing = param_routing
self.diff_routing = diff_routing
self.soft_routing = soft_routing
# router
assert not (self.param_routing and not self.diff_routing) # cannot be with_param=True and diff_routing=False
self.router = TopkRouting(qk_dim=self.qk_dim,
qk_scale=self.scale,
topk=self.topk,
diff_routing=self.diff_routing,
param_routing=self.param_routing)
if self.soft_routing: # soft routing, always diffrentiable (if no detach)
mul_weight = 'soft'
elif self.diff_routing: # hard differentiable routing
mul_weight = 'hard'
else: # hard non-differentiable routing
mul_weight = 'none'
self.kv_gather = KVGather(mul_weight=mul_weight)
# qkv mapping (shared by both global routing and local attention)
self.param_attention = param_attention
if self.param_attention == 'qkvo':
self.qkv = QKVLinear(self.dim, self.qk_dim)
self.wo = nn.Linear(dim, dim)
elif self.param_attention == 'qkv':
self.qkv = QKVLinear(self.dim, self.qk_dim)
self.wo = nn.Identity()
else:
raise ValueError(f'param_attention mode {self.param_attention} is not surpported!')
self.kv_downsample_mode = kv_downsample_mode
self.kv_per_win = kv_per_win
self.kv_downsample_ratio = kv_downsample_ratio
self.kv_downsample_kenel = kv_downsample_kernel
if self.kv_downsample_mode == 'ada_avgpool':
assert self.kv_per_win is not None
self.kv_down = nn.AdaptiveAvgPool2d(self.kv_per_win)
elif self.kv_downsample_mode == 'ada_maxpool':
assert self.kv_per_win is not None
self.kv_down = nn.AdaptiveMaxPool2d(self.kv_per_win)
elif self.kv_downsample_mode == 'maxpool':
assert self.kv_downsample_ratio is not None
self.kv_down = nn.MaxPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()
elif self.kv_downsample_mode == 'avgpool':
assert self.kv_downsample_ratio is not None
self.kv_down = nn.AvgPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()
elif self.kv_downsample_mode == 'identity': # no kv downsampling
self.kv_down = nn.Identity()
elif self.kv_downsample_mode == 'fracpool':
# assert self.kv_downsample_ratio is not None
# assert self.kv_downsample_kenel is not None
# TODO: fracpool
# 1. kernel size should be input size dependent
# 2. there is a random factor, need to avoid independent sampling for k and v
raise NotImplementedError('fracpool policy is not implemented yet!')
elif kv_downsample_mode == 'conv':
# TODO: need to consider the case where k != v so that need two downsample modules
raise NotImplementedError('conv policy is not implemented yet!')
else:
raise ValueError(f'kv_down_sample_mode {self.kv_downsaple_mode} is not surpported!')
# softmax for local attention
self.attn_act = nn.Softmax(dim=-1)
self.auto_pad=auto_pad
def forward(self, x, ret_attn_mask=False):
"""
x: NHWC tensor
Return:
NHWC tensor
"""
# NOTE: use padding for semantic segmentation
###################################################
if self.auto_pad:
N, H_in, W_in, C = x.size()
pad_l = pad_t = 0
pad_r = (self.n_win - W_in % self.n_win) % self.n_win
pad_b = (self.n_win - H_in % self.n_win) % self.n_win
x = F.pad(x, (0, 0, # dim=-1
pad_l, pad_r, # dim=-2
pad_t, pad_b)) # dim=-3
_, H, W, _ = x.size() # padded size
else:
N, H, W, C = x.size()
#assert H%self.n_win == 0 and W%self.n_win == 0 #
###################################################
# patchify, (n, p^2, w, w, c), keep 2d window as we need 2d pooling to reduce kv size
x = rearrange(x, "n (j h) (i w) c -> n (j i) h w c", j=self.n_win, i=self.n_win)
#################qkv projection###################
# q: (n, p^2, w, w, c_qk)
# kv: (n, p^2, w, w, c_qk+c_v)
# NOTE: separte kv if there were memory leak issue caused by gather
q, kv = self.qkv(x)
# pixel-wise qkv
# q_pix: (n, p^2, w^2, c_qk)
# kv_pix: (n, p^2, h_kv*w_kv, c_qk+c_v)
q_pix = rearrange(q, 'n p2 h w c -> n p2 (h w) c')
kv_pix = self.kv_down(rearrange(kv, 'n p2 h w c -> (n p2) c h w'))
kv_pix = rearrange(kv_pix, '(n j i) c h w -> n (j i) (h w) c', j=self.n_win, i=self.n_win)
q_win, k_win = q.mean([2, 3]), kv[..., 0:self.qk_dim].mean([2, 3]) # window-wise qk, (n, p^2, c_qk), (n, p^2, c_qk)
##################side_dwconv(lepe)##################
# NOTE: call contiguous to avoid gradient warning when using ddp
lepe = self.lepe(rearrange(kv[..., self.qk_dim:], 'n (j i) h w c -> n c (j h) (i w)', j=self.n_win, i=self.n_win).contiguous())
lepe = rearrange(lepe, 'n c (j h) (i w) -> n (j h) (i w) c', j=self.n_win, i=self.n_win)
############ gather q dependent k/v #################
r_weight, r_idx = self.router(q_win, k_win) # both are (n, p^2, topk) tensors
kv_pix_sel = self.kv_gather(r_idx=r_idx, r_weight=r_weight, kv=kv_pix) #(n, p^2, topk, h_kv*w_kv, c_qk+c_v)
k_pix_sel, v_pix_sel = kv_pix_sel.split([self.qk_dim, self.dim], dim=-1)
# kv_pix_sel: (n, p^2, topk, h_kv*w_kv, c_qk)
# v_pix_sel: (n, p^2, topk, h_kv*w_kv, c_v)
######### do attention as normal ####################
k_pix_sel = rearrange(k_pix_sel, 'n p2 k w2 (m c) -> (n p2) m c (k w2)', m=self.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_kq//m) transpose here?
v_pix_sel = rearrange(v_pix_sel, 'n p2 k w2 (m c) -> (n p2) m (k w2) c', m=self.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_v//m)
q_pix = rearrange(q_pix, 'n p2 w2 (m c) -> (n p2) m w2 c', m=self.num_heads) # to BMLC tensor (n*p^2, m, w^2, c_qk//m)
# param-free multihead attention
attn_weight = (q_pix * self.scale) @ k_pix_sel # (n*p^2, m, w^2, c) @ (n*p^2, m, c, topk*h_kv*w_kv) -> (n*p^2, m, w^2, topk*h_kv*w_kv)
attn_weight = self.attn_act(attn_weight)
out = attn_weight @ v_pix_sel # (n*p^2, m, w^2, topk*h_kv*w_kv) @ (n*p^2, m, topk*h_kv*w_kv, c) -> (n*p^2, m, w^2, c)
out = rearrange(out, '(n j i) m (h w) c -> n (j h) (i w) (m c)', j=self.n_win, i=self.n_win,
h=H//self.n_win, w=W//self.n_win)
out = out + lepe
# output linear
out = self.wo(out)
# NOTE: use padding for semantic segmentation
# crop padded region
if self.auto_pad and (pad_r > 0 or pad_b > 0):
out = out[:, :H_in, :W_in, :].contiguous()
if ret_attn_mask:
return out, r_weight, r_idx, attn_weight
else:
return out
class TransformerMLPWithConv(nn.Module):
def __init__(self, channels, expansion, drop):
super().__init__()
self.dim1 = channels
self.dim2 = channels * expansion
self.linear1 = nn.Sequential(
nn.Conv2d(self.dim1, self.dim2, 1, 1, 0),
# nn.GELU(),
# nn.BatchNorm2d(self.dim2, eps=1e-5)
)
self.drop1 = nn.Dropout(drop, inplace=True)
self.act = nn.GELU()
# self.bn = nn.BatchNorm2d(self.dim2, eps=1e-5)
self.linear2 = nn.Sequential(
nn.Conv2d(self.dim2, self.dim1, 1, 1, 0),
# nn.BatchNorm2d(self.dim1, eps=1e-5)
)
self.drop2 = nn.Dropout(drop, inplace=True)
self.dwc = nn.Conv2d(self.dim2, self.dim2, 3, 1, 1, groups=self.dim2)
def forward(self, x):
x = self.linear1(x)
x = self.drop1(x)
x = x + self.dwc(x)
x = self.act(x)
# x = self.bn(x)
x = self.linear2(x)
x = self.drop2(x)
return x
class DeBiLevelRoutingAttentionblcok(nn.Module):
"""
n_win: number of windows in one side (so the actual number of windows is n_win*n_win)
kv_per_win: for kv_downsample_mode='ada_xxxpool' only, number of key/values per window. Similar to n_win, the actual number is kv_per_win*kv_per_win.
topk: topk for window filtering
param_attention: 'qkvo'-linear for q,k,v and o, 'none': param free attention
param_routing: extra linear for routing
diff_routing: wether to set routing differentiable
soft_routing: wether to multiply soft routing weights
"""
def __init__(self, dim, num_heads=8, n_win=7, qk_dim=None, qk_scale=None,
kv_per_win=4, kv_downsample_ratio=4, kv_downsample_kernel=None, kv_downsample_mode='identity',
topk=4, param_attention="qkvo", param_routing=False, diff_routing=False, soft_routing=False, side_dwconv=3,
auto_pad=True, param_size='small'):
super().__init__()
# local attention setting
self.dim = dim
self.n_win = n_win # Wh, Ww
self.num_heads = num_heads
self.qk_dim = qk_dim or dim
#############################################################
if param_size=='tiny':
if self.dim == 64 :
self.n_groups = 1
self.top_k_def = 16 # 2 128
self.kk = 9
self.stride_def = 8
self.expain_ratio = 3
self.q_size=to_2tuple(56)
if self.dim == 128 :
self.n_groups = 2
self.top_k_def = 16 # 4 256
self.kk = 7
self.stride_def = 4
self.expain_ratio = 3
self.q_size=to_2tuple(28)
if self.dim == 256 :
self.n_groups = 4
self.top_k_def = 4 # 8 512
self.kk = 5
self.stride_def = 2
self.expain_ratio = 3
self.q_size=to_2tuple(14)
if self.dim == 512 :
self.n_groups = 8
self.top_k_def = 49 # 8 512
self.kk = 3
self.stride_def = 1
self.expain_ratio = 3
self.q_size=to_2tuple(7)
#############################################################
if param_size=='small':
if self.dim == 64 :
self.n_groups = 1
self.top_k_def = 16 # 2 128
self.kk = 9
self.stride_def = 8
self.expain_ratio = 3
self.q_size=to_2tuple(56)
if self.dim == 128 :
self.n_groups = 2
self.top_k_def = 16 # 4 256
self.kk = 7
self.stride_def = 4
self.expain_ratio = 3
self.q_size=to_2tuple(28)
if self.dim == 256 :
self.n_groups = 4
self.top_k_def = 4 # 8 512
self.kk = 5
self.stride_def = 2
self.expain_ratio = 3
self.q_size=to_2tuple(14)
if self.dim == 512 :
self.n_groups = 8
self.top_k_def = 49 # 8 512
self.kk = 3
self.stride_def = 1
self.expain_ratio = 1
self.q_size=to_2tuple(7)
#############################################################
if param_size=='base':
if self.dim == 96 :
self.n_groups = 1
self.top_k_def = 16 # 2 128
self.kk = 9
self.stride_def = 8
self.expain_ratio = 3
self.q_size=to_2tuple(56)
if self.dim == 192 :
self.n_groups = 2
self.top_k_def = 16 # 4 256
self.kk = 7
self.stride_def = 4
self.expain_ratio = 3
self.q_size=to_2tuple(28)
if self.dim == 384 :
self.n_groups = 3
self.top_k_def = 4 # 8 512
self.kk = 5
self.stride_def = 2
self.expain_ratio = 3
self.q_size=to_2tuple(14)
if self.dim == 768 :
self.n_groups = 6
self.top_k_def = 49 # 8 512
self.kk = 3
self.stride_def = 1
self.expain_ratio = 3
self.q_size=to_2tuple(7)
self.q_h, self.q_w = self.q_size
self.kv_h, self.kv_w = self.q_h // self.stride_def, self.q_w // self.stride_def
self.n_group_channels = self.dim // self.n_groups
self.n_group_heads = self.num_heads // self.n_groups
self.n_group_channels = self.dim // self.n_groups
self.offset_range_factor = -1
self.head_channels = dim // num_heads
self.n_group_heads = self.num_heads // self.n_groups
#assert self.qk_dim % num_heads == 0 and self.dim % num_heads==0, 'qk_dim and dim must be divisible by num_heads!'
self.scale = qk_scale or self.qk_dim ** -0.5
self.rpe_table = nn.Parameter(
torch.zeros(self.num_heads, self.q_h * 2 - 1, self.q_w * 2 - 1)
)
trunc_normal_(self.rpe_table, std=0.01)
################side_dwconv (i.e. LCE in ShuntedTransformer)###########
self.lepe1 = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=self.stride_def, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \
lambda x: torch.zeros_like(x)
################ global routing setting #################
self.topk = topk
self.param_routing = param_routing
self.diff_routing = diff_routing
self.soft_routing = soft_routing
# router
#assert not (self.param_routing and not self.diff_routing) # cannot be with_param=True and diff_routing=False
self.router = TopkRouting(qk_dim=self.qk_dim,
qk_scale=self.scale,
topk=self.topk,
diff_routing=self.diff_routing,
param_routing=self.param_routing)
if self.soft_routing: # soft routing, always diffrentiable (if no detach)
mul_weight = 'soft'
elif self.diff_routing: # hard differentiable routing
mul_weight = 'hard'
else: # hard non-differentiable routing
mul_weight = 'none'
self.kv_gather = KVGather(mul_weight=mul_weight)
# qkv mapping (shared by both global routing and local attention)
self.param_attention = param_attention
if self.param_attention == 'qkvo':
#self.qkv = QKVLinear(self.dim, self.qk_dim)
self.qkv_conv = QKVConv(self.dim, self.qk_dim)
#self.wo = nn.Linear(dim, dim)
elif self.param_attention == 'qkv':
#self.qkv = QKVLinear(self.dim, self.qk_dim)
self.qkv_conv = QKVConv(self.dim, self.qk_dim)
#self.wo = nn.Identity()
else:
raise ValueError(f'param_attention mode {self.param_attention} is not surpported!')
self.kv_downsample_mode = kv_downsample_mode
self.kv_per_win = kv_per_win
self.kv_downsample_ratio = kv_downsample_ratio
self.kv_downsample_kenel = kv_downsample_kernel
if self.kv_downsample_mode == 'ada_avgpool':
assert self.kv_per_win is not None
self.kv_down = nn.AdaptiveAvgPool2d(self.kv_per_win)
elif self.kv_downsample_mode == 'ada_maxpool':
assert self.kv_per_win is not None
self.kv_down = nn.AdaptiveMaxPool2d(self.kv_per_win)
elif self.kv_downsample_mode == 'maxpool':
assert self.kv_downsample_ratio is not None
self.kv_down = nn.MaxPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()
elif self.kv_downsample_mode == 'avgpool':
assert self.kv_downsample_ratio is not None
self.kv_down = nn.AvgPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()
elif self.kv_downsample_mode == 'identity': # no kv downsampling
self.kv_down = nn.Identity()
elif self.kv_downsample_mode == 'fracpool':
raise NotImplementedError('fracpool policy is not implemented yet!')
elif kv_downsample_mode == 'conv':
raise NotImplementedError('conv policy is not implemented yet!')
else:
raise ValueError(f'kv_down_sample_mode {self.kv_downsaple_mode} is not surpported!')
self.attn_act = nn.Softmax(dim=-1)
self.auto_pad=auto_pad
##########################################################################################
self.proj_q = nn.Conv2d(
dim, dim,
kernel_size=1, stride=1, padding=0
)
self.proj_k = nn.Conv2d(
dim, dim,
kernel_size=1, stride=1, padding=0
)
self.proj_v = nn.Conv2d(
dim, dim,
kernel_size=1, stride=1, padding=0
)
self.proj_out = nn.Conv2d(
dim, dim,
kernel_size=1, stride=1, padding=0
)
self.unifyheads1 = nn.Conv2d(
dim, dim,
kernel_size=1, stride=1, padding=0
)
self.conv_offset_q = nn.Sequential(
nn.Conv2d(self.n_group_channels, self.n_group_channels, (self.kk,self.kk), (self.stride_def,self.stride_def), (self.kk//2,self.kk//2), groups=self.n_group_channels, bias=False),
LayerNormProxy(self.n_group_channels),
nn.GELU(),
nn.Conv2d(self.n_group_channels, 1, 1, 1, 0, bias=False),
)
### FFN
self.norm = nn.LayerNorm(dim, eps=1e-6)
self.norm2 = nn.LayerNorm(dim, eps=1e-6)
self.mlp =TransformerMLPWithConv(dim, self.expain_ratio, 0.)
@torch.no_grad()
def _get_ref_points(self, H_key, W_key, B, dtype, device):
ref_y, ref_x = torch.meshgrid(
torch.linspace(0.5, H_key - 0.5, H_key, dtype=dtype, device=device),
torch.linspace(0.5, W_key - 0.5, W_key, dtype=dtype, device=device)
)
ref = torch.stack((ref_y, ref_x), -1)
ref[..., 1].div_(W_key).mul_(2).sub_(1)
ref[..., 0].div_(H_key).mul_(2).sub_(1)
ref = ref[None, ...].expand(B * self.n_groups, -1, -1, -1) # B * g H W 2
return ref
@torch.no_grad()
def _get_q_grid(self, H, W, B, dtype, device):
ref_y, ref_x = torch.meshgrid(
torch.arange(0, H, dtype=dtype, device=device),
torch.arange(0, W, dtype=dtype, device=device),
indexing='ij'
)
ref = torch.stack((ref_y, ref_x), -1)
ref[..., 1].div_(W - 1.0).mul_(2.0).sub_(1.0)
ref[..., 0].div_(H - 1.0).mul_(2.0).sub_(1.0)
ref = ref[None, ...].expand(B * self.n_groups, -1, -1, -1) # B * g H W 2
return ref
def forward(self, x, ret_attn_mask=False):
dtype, device = x.dtype, x.device
"""
x: NHWC tensor
Return:
NHWC tensor
"""
# NOTE: use padding for semantic segmentation
###################################################
if self.auto_pad:
N, H_in, W_in, C = x.size()
pad_l = pad_t = 0
pad_r = (self.n_win - W_in % self.n_win) % self.n_win
pad_b = (self.n_win - H_in % self.n_win) % self.n_win
x = F.pad(x, (0, 0, # dim=-1
pad_l, pad_r, # dim=-2
pad_t, pad_b)) # dim=-3
_, H, W, _ = x.size() # padded size
else:
N, H, W, C = x.size()
assert H%self.n_win == 0 and W%self.n_win == 0 #
#print("X_in")
#print(x.shape)
###################################################
#q=self.proj_q_def(x)
x_res = rearrange(x, "n h w c -> n c h w")
#################qkv projection###################
q,kv = self.qkv_conv(x.permute(0, 3, 1, 2))
q_bi = rearrange(q, "n c (j h) (i w) -> n (j i) h w c", j=self.n_win, i=self.n_win)
kv = rearrange(kv, "n c (j h) (i w) -> n (j i) h w c", j=self.n_win, i=self.n_win)
q_pix = rearrange(q_bi, 'n p2 h w c -> n p2 (h w) c')
kv_pix = self.kv_down(rearrange(kv, 'n p2 h w c -> (n p2) c h w'))
kv_pix = rearrange(kv_pix, '(n j i) c h w -> n (j i) (h w) c', j=self.n_win, i=self.n_win)
##################side_dwconv(lepe)##################
# NOTE: call contiguous to avoid gradient warning when using ddp
lepe1 = self.lepe1(rearrange(kv[..., self.qk_dim:], 'n (j i) h w c -> n c (j h) (i w)', j=self.n_win, i=self.n_win).contiguous())
################################################################# Offset Q
q_off = rearrange(q, 'b (g c) h w -> (b g) c h w', g=self.n_groups, c=self.n_group_channels)
offset_q = self.conv_offset_q(q_off).contiguous() # B * g 2 Sg HWg
Hk, Wk = offset_q.size(2), offset_q.size(3)
n_sample = Hk * Wk
if self.offset_range_factor > 0:
offset_range = torch.tensor([1.0 / Hk, 1.0 / Wk], device=device).reshape(1, 2, 1, 1)
offset_q = offset_q.tanh().mul(offset_range).mul(self.offset_range_factor)
offset_q = rearrange(offset_q, 'b p h w -> b h w p') # B * g 2 Hg Wg -> B*g Hg Wg 2
reference = self._get_ref_points(Hk, Wk, N, dtype, device)
if self.offset_range_factor >= 0:
pos_k = offset_q + reference
else:
pos_k = (offset_q + reference).clamp(-1., +1.)
x_sampled_q = F.grid_sample(
input=x_res.reshape(N * self.n_groups, self.n_group_channels, H, W),
grid=pos_k[..., (1, 0)], # y, x -> x, y
mode='bilinear', align_corners=True) # B * g, Cg, Hg, Wg
q_sampled = x_sampled_q.reshape(N, C, Hk, Wk)
######## Bi-LEVEL Gathering
if self.auto_pad:
q_sampled=q_sampled.permute(0, 2, 3, 1)
Ng, Hg, Wg, Cg = q_sampled.size()
pad_l = pad_t = 0
pad_rg = (self.n_win - Wg % self.n_win) % self.n_win
pad_bg = (self.n_win - Hg % self.n_win) % self.n_win
q_sampled = F.pad(q_sampled, (0, 0, # dim=-1
pad_l, pad_rg, # dim=-2
pad_t, pad_bg)) # dim=-3
_, Hg, Wg, _ = q_sampled.size() # padded size
q_sampled=q_sampled.permute(0, 3, 1, 2)
lepe1 = F.pad(lepe1.permute(0, 2, 3, 1), (0, 0, # dim=-1
pad_l, pad_rg, # dim=-2
pad_t, pad_bg)) # dim=-3
lepe1=lepe1.permute(0, 3, 1, 2)
pos_k = F.pad(pos_k, (0, 0, # dim=-1
pad_l, pad_rg, # dim=-2
pad_t, pad_bg)) # dim=-3
queries_def = self.proj_q(q_sampled) #Linnear projection
queries_def = rearrange(queries_def, "n c (j h) (i w) -> n (j i) h w c", j=self.n_win, i=self.n_win).contiguous()
q_win, k_win = queries_def.mean([2, 3]), kv[..., 0:(self.qk_dim)].mean([2, 3])
r_weight, r_idx = self.router(q_win, k_win)
kv_gather = self.kv_gather(r_idx=r_idx, r_weight=r_weight, kv=kv_pix) # (n, p^2, topk, h_kv*w_kv, c )
k_gather, v_gather = kv_gather.split([self.qk_dim, self.dim], dim=-1)
### Bi-level Routing MHA
k = rearrange(k_gather, 'n p2 k hw (m c) -> (n p2) m c (k hw)', m=self.num_heads)
v = rearrange(v_gather, 'n p2 k hw (m c) -> (n p2) m (k hw) c', m=self.num_heads)
q_def = rearrange(queries_def, 'n p2 h w (m c)-> (n p2) m (h w) c',m=self.num_heads)
attn_weight = (q_def * self.scale) @ k
attn_weight = self.attn_act(attn_weight)
out = attn_weight @ v
out_def = rearrange(out, '(n j i) m (h w) c -> n (m c) (j h) (i w)', j=self.n_win, i=self.n_win, h=Hg//self.n_win, w=Wg//self.n_win).contiguous()
out_def = out_def + lepe1
out_def = self.unifyheads1(out_def)
out_def = q_sampled + out_def
out_def = out_def + self.mlp(self.norm2(out_def.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)) # (N, C, H, W)
#############################################################################################
######## Deformable Gathering
#############################################################################################
out_def = self.norm(out_def.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
k = self.proj_k(out_def)
v = self.proj_v(out_def)
k_pix_sel = rearrange(k, 'n (m c) h w -> (n m) c (h w)', m=self.num_heads)
v_pix_sel = rearrange(v, 'n (m c) h w -> (n m) c (h w)', m=self.num_heads)
q_pix = rearrange(q, 'n (m c) h w -> (n m) c (h w)', m=self.num_heads)
attn = torch.einsum('b c m, b c n -> b m n', q_pix, k_pix_sel) # B * h, HW, Ns
attn = attn.mul(self.scale)
### Bias
rpe_table = self.rpe_table
rpe_bias = rpe_table[None, ...].expand(N, -1, -1, -1)
q_grid = self._get_q_grid(H, W, N, dtype, device)
displacement = (q_grid.reshape(N * self.n_groups, H * W, 2).unsqueeze(2) - pos_k.reshape(N * self.n_groups, Hg*Wg, 2).unsqueeze(1)).mul(0.5)
attn_bias = F.grid_sample(
input=rearrange(rpe_bias, 'b (g c) h w -> (b g) c h w', c=self.n_group_heads, g=self.n_groups),
grid=displacement[..., (1, 0)],
mode='bilinear', align_corners=True) # B * g, h_g, HW, Ns
attn_bias = attn_bias.reshape(N * self.num_heads, H * W, Hg*Wg)
attn = attn + attn_bias
###
attn = F.softmax(attn, dim=2)
out = torch.einsum('b m n, b c n -> b c m', attn, v_pix_sel)
out = out.reshape(N,C,H,W).contiguous()
out = self.proj_out(out).permute(0,2,3,1)
#############################################################################################
# NOTE: use padding for semantic segmentation
# crop padded region
if self.auto_pad and (pad_r > 0 or pad_b > 0):
out = out[:, :H_in, :W_in, :].contiguous()
if ret_attn_mask:
return out, r_weight, r_idx, attn_weight
else:
return out
def get_pe_layer(emb_dim, pe_dim=None, name='none'):
if name == 'none':
return nn.Identity()
else:
raise ValueError(f'PE name {name} is not surpported!')
class DeBiLevelRoutingAttention(nn.Module):
def __init__(self, dim,
num_heads=8, n_win=7, qk_dim=None, qk_scale=None,
kv_per_win=4, kv_downsample_ratio=4,
kv_downsample_kernel=None, kv_downsample_mode='ada_avgpool',
topk=4, param_attention="qkvo", param_routing=False,
diff_routing=False, soft_routing=False, mlp_ratio=4, param_size='small',mlp_dwconv=False,
side_dwconv=5, before_attn_dwconv=3, pre_norm=True, auto_pad=True):
super().__init__()
qk_dim = qk_dim or dim
# modules
if before_attn_dwconv > 0:
self.pos_embed1 = nn.Conv2d(dim, dim, kernel_size=before_attn_dwconv, padding=1, groups=dim)
self.pos_embed2 = nn.Conv2d(dim, dim, kernel_size=before_attn_dwconv, padding=1, groups=dim)
else:
self.pos_embed = lambda x: 0
self.norm1 = nn.LayerNorm(dim, eps=1e-6) # important to avoid attention collapsing
self.attn1 = BiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,
qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,
kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,
topk=1, param_attention=param_attention, param_routing=param_routing,
diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,
auto_pad=auto_pad)
self.attn2 = DeBiLevelRoutingAttentionblcok(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,
qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,
kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,
topk=topk, param_attention=param_attention, param_routing=param_routing,
diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,
auto_pad=auto_pad,param_size=param_size)
def forward(self, x):
"""
x: NCHW tensor
"""
# permute to NHWC tensor for attention & mlp
x = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)
x = self.attn2(x)
# permute back
x = x.permute(0, 3, 1, 2) # (N, H, W, C) -> (N, C, H, W)
return x详见:
https://blog.csdn.net/m0_63774211/article/details/149496682nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov13n.yaml' will call yolov13.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.5
0, 0.25, 1024] # Nano
s: [0.50, 0.50, 1024] # Small
l: [1.00, 1.00, 512] # Large
x: [1.00, 1.50, 512] # Extra Large
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2, 1, 2]] # 1-P2/4
- [-1, 2, DSC3k2, [256, False, 0.25]]
- [-1, 1, Conv, [256, 3, 2, 1, 4]] # 3-P3/8
- [-1, 2, DSC3k2, [512, False, 0.25]]
- [-1, 1, DSConv, [512, 3, 2]] # 5-P4/16
- [-1, 4, A2C2f, [512, True, 4]]
- [-1, 1, DSConv, [1024, 3, 2]] # 7-P5/32
- [-1, 4, A2C2f, [1024, True, 1]] # 8
- [-1, 1, DeBiLevelRoutingAttention, [1024]] # 9
head:
- [[4, 6, 8], 2, HyperACE, [512, 8, True, True, 0.5, 1, "both"]]
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [ 10, 1, DownsampleConv, []]
- [[6, 10], 1, FullPAD_Tunnel, []] #13
- [[4, 11], 1, FullPAD_Tunnel, []] #14
- [[9, 12], 1, FullPAD_Tunnel, []] #15
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 13], 1, Concat, [1]] # cat backbone P4
- [-1, 2, DSC3k2, [512, True]] # 18
- [[-1, 10], 1, FullPAD_Tunnel, []] #19
- [18, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 14], 1, Concat, [1]] # cat backbone P3
- [-1, 2, DSC3k2, [256, True]] # 22
- [11, 1, Conv, [256, 1, 1]]
- [[22, 23], 1, FullPAD_Tunnel, []] #24
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 19], 1, Concat, [1]] # cat head P4
- [-1, 2, DSC3k2, [512, True]] # 27
- [[-1, 10], 1, FullPAD_Tunnel, []]
- [27, 1, Conv, [512, 3, 2]]
- [[-1, 15], 1, Concat, [1]] # cat head P5
- [-1, 2, DSC3k2, [1024,True]] # 31 (P5/32-large)
- [[-1, 12], 1, FullPAD_Tunnel, []]
- [[24, 28, 32], 1, Detect, [nc]] # Detect(P3, P4, P5)
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。