【源头活水】AlphaGo Zero技术梳理

马上科普尚尚

发布于 2021-03-17 10:56:23

8000

发布于 2021-03-17 10:56:23

“问渠那得清如许，为有源头活水来”，通过前沿领域知识的学习，从其他研究领域得到启发，对研究问题的本质有更清晰的认识和理解，是自我提高的不竭源泉。为此，我们特别精选论文阅读笔记，开辟“源头活水”专栏，帮助你广泛而深入的阅读科研文献，敬请关注。

作者：知乎—sansi2014

地址：https://www.zhihu.com/people/sansi2014

“我预见了所有悲伤，但我依然愿意前往。” ——《降临》

这是第一个没有使用人类棋谱知识，却可以打败人类的围棋算法。

What

由于没有人类知识，知识的累积只有来自self-play，从AlphaGo提到用policy gradient也可以通过self-play提升自己；不过也许因为效率不够或者其它原因，DM选择（发现）用MCTS结合神经网络来进行self-play和策略提升。

This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. we introduce a new reinforcement learning algorithm that incorporates look ahead search inside the training loop, resulting in rapid improvement and precise and stable learning.

算法整体简单了很多：rollout policy取消了，policy network和value network主干网络合并，MCTS主要动作保留（细节有所改变）。作为对比，AlphaGo的梳理在这里：

https://zhuanlan.zhihu.com/p/351108250

虽然名字叫policy network和value network，但它们的训练已经和常规的policy gradient或者Bellman方程没什么关系了。从损失函数看已经变成了一个监督学习。

Why

提升：MCTS是个向前搜索的算法，提供给它的信息越准确，它搜索的效率越高、结果越准确。即使是基于随机策略，MCTS也可以得到有用的信息，毕竟看得见未来总是有优势。只是随机策略的效率会比较低。

数据：所以MCTS总能够基于原策略提供更好的self-play数据。

正反馈：神经网络基于MCTS self-play得到的更好的数据，可以使用监督学习的方法提升自身。下一次迭代的时候就可以提供给MCTS更好的策略，头尾相接，形成正反馈。

如此循环往复，不断提升。以下是文章节选，说明MCTS是如何对应到policy improvement和policy evaluation，注意形容搜索策略是“much stronger”

The AlphaGo Zero self-play algorithm can similarly be understood as an approximate policy iteration scheme in which MCTS is used for both policy improvement and policy evaluation. Policy improvement starts with a neural network policy, executes an MCTS based on that policy's recommendations, and then projects the (much stronger) search policy back into the function space of the neural network. Policy evaluation is applied to the (much stronger) search policy: the outcomes of self-play games are also projected back into the function space of the neural network. These projection steps are achieved by training the neural network parameters to match the search probabilities and self-play game outcome respectively.

How

原文图1过于简单，所以根据理解重新作图：