CVPR 2020(Oral) | Facebook提出X3D：超轻量级视频理解/行为识别新作

Amusi

发布于 2020-04-23 10:15:09

1.9K0

发布于 2020-04-23 10:15:09

文章被收录于专栏：CVer

本文作者：木石 https://zhuanlan.zhihu.com/p/129279351 本文已由原作者授权，不得擅自二次转载

Facebook FAIR 于CVPR 2020 提出了轻量级 X3D 行为识别模型，采用4.8\x~GFLOPs 和 5.5\x~ parameters就可以取得与之前SOTA媲美的结果。

论文：https://arxiv.org/abs/2004.04730

代码链接（还未开源）：

https://github.com/facebookresearch/SlowFast

PS；这篇工作作者就Christoph Feichtenhofer老哥一个人，也太秀了吧！不过致谢(后援团)中出现了下面几位（瑟瑟发抖.jpg）：

本文主要对 2D conv 在不同维度进行expand。基本思路在EfficientNet延伸，在3D卷积中对各个系数进行调整：video clip 长度，帧率，图像特征分辨率，宽度和深度。
受到ML中特征选择方法的启发，设计 stepwise network expansion approach，每个step中，对各个维度单独扩张分别训练一个model，选择扩张效果最好的维度。大大减小搜索优化的复杂度
参考坐标下降法,每次对单个维度进行expand
最终的model very thin（特别是使用了channel-wise separable convolution），同时block的 width 非常小。

Abstract

This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8\x~and 5.5\x~fewer multiply-adds and parameters for similar accuracy as previous work.

expand model from the 2D space into 3D spacetime domain.

坐标下降法 Coordinate Descent

关于坐标下降法，以下几点需要注意：

坐标轴的顺序可以是任意的，可以使用{1,2,...,n}的任何排列
坐标下降法与梯度下降法的不同之处在于不需要计算目标函数的梯度，每次迭代只用在单一维度上进行线性搜索，但是坐标下降法只适用于光滑函数，如果是非光滑函数可能会陷入非驻点从而无法更新。
严格的坐标下降法，即每次仅沿着单个坐标轴的方向寻找函数极小值。与之相对应的是块坐标下降法，即每次沿着多个坐标轴的方向（超平面）取极值，它通过对变量的子集进行同时优化，把原问题分解为多个子问题。在下降的过程中更新的次序可以是确定或随机的，如果是确定的次序，我们可以设计某种方式，或是周期或是贪心的方法选择更新子集。

考虑一个优化任务：

一个块坐标下降的通用框架如下图所示：

特征选择

使用贪心的方法找到能提升 performance 的relevant features
删去对performance最小的feature

ﬁnd relevant features to improve in a greedy fashion by including (forward selection) a single feature in each step, or start with a full set of features and aim to ﬁnd irrelevant ones that are excluded by repeatedly deleting the feature that reduces performance the least (backward elimination).

X2D baseline

X2D baseline

Expansion operations

文章中设计了以下几种Expansion operations：

Progressive Network Expansion

Forward expansion

定义

expansion 的代价非常小

expansion is simple and cheap e.g. our low-compute model is completed after only training 30 tiny models that accumulatively require over 25× fewer multiply-add operations for training than one large state-of-the-art network

Backward contraction

如果缩放后的model超过了target complexity（GFLOPs），对缩放因子expansion-rate大小进行略微的压缩，比如略小于2

优化结果

An expansion of the depth after increasing input resolution is intuitive, as it allows to grow the filter receptive field resolution and size within each residual stage.
A surprising ﬁnding of our progressive expansion is that networks with thin channel dimension and high spatiotemporal resolution can be effective for video recognition.