前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

作者头像
WZEARW
发布于 2018-06-05 08:27:22
发布于 2018-06-05 08:27:22
7370
举报
文章被收录于专栏:专知专知

【导读】专知内容组整理了最近六篇强化学习(Reinforcement Learning)相关文章,为大家进行介绍,欢迎查看!

1. Multiagent Soft Q-Learning



作者:Ermo Wei,Drew Wicke,David Freelan,Sean Luke

机构:George Mason University

摘要:Policy gradient methods are often applied to reinforcement learning in continuous multiagent games. These methods perform local search in the joint-action space, and as we show, they are susceptable to a game-theoretic pathology known as relative overgeneralization. To resolve this issue, we propose Multiagent Soft Q-learning, which can be seen as the analogue of applying Q-learning to continuous controls. We compare our method to MADDPG, a state-of-the-art approach, and show that our method achieves better coordination in multiagent cooperative tasks, converging to better local optima in the joint action space.

期刊:arXiv, 2018年4月26日

网址

http://www.zhuanzhi.ai/document/8f0ac64488ab8fa83554b4da6cc2f69d

2. Variance Reduction Methods for Sublinear Reinforcement Learning(Sublinear强化学习的方差缩减方法)



作者:Sham Kakade,Mengdi Wang,Lin F. Yang

机构:Princeton University,University of Washington

摘要:This work considers the problem of provably optimal reinforcement learning for episodic finite horizon MDPs, i.e. how an agent learns to maximize his/her long term reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP, $S$ is the number of states, $A$ is the number of actions, and $H$ is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve $\epsilon$-average regret for any constant $\epsilon$ is $O(SA)$, which is a number of samples that is far less than that required to learn any non-trivial estimate of the transition model (the transition model is specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as $Q$-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the $\Omega(\sqrt{HSAT})$ lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in $Q$-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the asymptotically optimal regret.

期刊:arXiv, 2018年4月26日

网址

http://www.zhuanzhi.ai/document/298d70f33245af2313394e0f6de96a73

3. Reinforced Mnemonic Reader for Machine Reading Comprehension(基于强化记忆的机器阅读理解)



作者:Minghao Hu,Yuxing Peng,Zhen Huang,Xipeng Qiu,Furu Wei,Ming Zhou

机构:Fudan University,National University of Defense Technology

摘要:In this paper, we introduce the Reinforced Mnemonic Reader for machine reading comprehension tasks, which enhances previous attentive readers in two aspects. First, a reattention mechanism is proposed to refine current attentions by directly accessing to past attentions that are temporally memorized in a multi-round alignment architecture, so as to avoid the problems of attention redundancy and attention deficiency. Second, a new optimization approach, called dynamic-critical reinforcement learning, is introduced to extend the standard supervised method. It always encourages to predict a more acceptable answer so as to address the convergence suppression problem occurred in traditional reinforcement learning algorithms. Extensive experiments on the Stanford Question Answering Dataset (SQuAD) show that our model achieves state-of-the-art results. Meanwhile, our model outperforms previous systems by over 6% in terms of both Exact Match and F1 metrics on two adversarial SQuAD datasets.

期刊:arXiv, 2018年4月25日

网址

http://www.zhuanzhi.ai/document/37c93c1bdb68c3559d4f5f1740093d7d

4. Accelerated Reinforcement Learning(加速强化学习)



作者:K. Lakshmanan

摘要:Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.

期刊:arXiv, 2018年4月25日

网址

http://www.zhuanzhi.ai/document/23e73fe759219ab8d58317acce28dc5f

5. No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling(没有一个标准是完美的:对视觉叙事的对抗性奖励学习)



作者:Xin Wang,Wenhu Chen,Yuan-Fang Wang,William Yang Wang

机构:University of California

摘要:Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different from captions, stories have more expressive language styles and contain many imaginary concepts that do not appear in the images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due to the limitations of automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties in gaining an overall performance boost. Therefore, we propose an Adversarial REward Learning (AREL) framework to learn an implicit reward function from human demonstrations, and then optimize policy search with the learned reward function. Though automatic evaluation indicates slight performance boost over state-of-the-art (SOTA) methods in cloning expert behaviors, human evaluation shows that our approach achieves significant improvement in generating more human-like stories than SOTA systems.

期刊:arXiv, 2018年4月25日

网址

http://www.zhuanzhi.ai/document/5dc12a9a8438755a167e2aa4a12f3fff

6. Neural Network Based Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction(用基于神经网络的强化学习做人机交互中的视听注视控制)



作者:Stéphane Lathuilière,Benoit Massé,Pablo Mesejo,Radu Horaud

摘要:This paper introduces a novel neural network-based reinforcement learning approach for robot gaze control. Our approach enables a robot to learn and to adapt its gaze control strategy for human-robot interaction neither with the use of external sensors nor with human supervision. The robot learns to focus its attention onto groups of people from its own audio-visual experiences, independently of the number of people, of their positions and of their physical appearances. In particular, we use a recurrent neural network architecture in combination with Q-learning to find an optimal action-selection policy; we pre-train the network using a simulated environment that mimics realistic scenarios that involve speaking/silent participants, thus avoiding the need of tedious sessions of a robot interacting with people. Our experimental evaluation suggests that the proposed method is robust against parameter estimation, i.e. the parameter values yielded by the method do not have a decisive impact on the performance. The best results are obtained when both audio and visual information is jointly used. Experiments with the Nao robot indicate that our framework is a step forward towards the autonomous learning of socially acceptable gaze behavior.

期刊:arXiv, 2018年4月23日

网址

http://www.zhuanzhi.ai/document/1a62223dcd16d5d4af934daaba9c11b6

-END-

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2018-04-28,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 专知 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
【自己动手画CPU】运算器设计
(2) 熟悉 Logisim 平台基本功能,能在 logisim 中实现多位可控加减法电路。
SarPro
2024/02/20
1.3K0
【自己动手画CPU】运算器设计
一个案例搞懂原码、反码、补码,不懂得请看过来
根据冯~诺依曼提出的经典计算机体系结构框架。一台计算机由运算器,控制器,存储器,输入和输出设备组成。其中运算器,只有加法运算器,没有减法运算器(据说一开始是有的,后来由于减法器硬件开销太大,被废了 )
用户7656790
2020/08/13
1.2K0
一个案例搞懂原码、反码、补码,不懂得请看过来
机器数及特点
机器数及特点 <1> 为什么要研究机器内的数据表示 目的:组织数据,方便计算机硬件直接使用 要考虑的因素 - 支持的数据类型 - 能表示的数据范围 - 能表示的数据精度 - 存储和处理的代价 - 是否有利于软件的移植 - .... <2> 机器内的数据表示 真值:符号用 "+"、 "-" 表示的数据表示方法 机器数:符号数值化的数据表示方法,用0、1表示符号 三种常见的机器数:设定点数的形式为X<sub>0</sub>X<sub>1</sub>X<sub>2</sub>X<sub>3
ruochen
2021/05/15
6680
机器数及特点
【自己动手画CPU】控制器设计(二)
(2) 熟悉 Logisim 平台基本功能,能在 logisim 中实现多位可控加减法电路。
SarPro
2024/02/20
1.9K0
【自己动手画CPU】控制器设计(二)
关于补码,大学老师讲的很不负责任
关于补码,我大学的计算机老师都是这样教的:补码是原码按位取反,最后一位加 1。我想说的是这样的解释很不负责任,除了让你死记硬背之外,对你理解计算机没有任何意义,本文来告诉你为什么会有补码,怎么正确的理解计算机的补码。
somenzz
2021/11/04
6190
流水线乘加树需求设计规划代码实现
需求 计算两个长度为2的幂次方的向量的对应位置相乘相加结果 输入为补码,输出为补码(支持负数) 输入位宽可配置,输入向量的宽度可配置,输出位宽由以上两项决定 设计规划 参数表 参数名称 说明 默认值 DIN_WIDTH 输入位宽 8 DIN_NUM_LOG 输入向量的宽度的log2值(宽度$$2^{DIN_NUM_LOG}$$) 2 注:输出位宽由以上决定,为$$DOUT_WIDTH = DIN_WIDTH \times 2 + DIN_NUM_LOG - 1$$ 端口列表 端
月见樽
2018/04/27
8380
流水线乘加树需求设计规划代码实现
数据在内存中的存储(学好编程必不可少!)
因为在计算机系统中,数值统一用补码来表示和存储。原因在于,用补码来存储,可以将符号位和数值统一处理,同时加法减法也可以统一处理(CPU只有加法器),补码和原码相互转换,其运算过程是相同的,不需要额外的硬件电路。
用户11036582
2024/03/21
1240
数据在内存中的存储(学好编程必不可少!)
Booth算法: 补码一位乘法公式推导与解析
大二学生一只,我的计组老师比较划水,不讲公式推导,所以最近自己研究了下Booth算法的公式推导,希望能让同样在研究Booth算法的小伙伴少花点时间。
执生
2020/09/27
3.1K0
通过Java代码来模拟乘法器
cpu中乘法器的执行流程 Java模拟乘法器代码 /** * 32 bit multiplier mock * @param a * @param b
囚兔
2018/02/08
1.1K0
通过Java代码来模拟乘法器
计算机组成与设计(六)—— 乘法器[通俗易懂]
人们日常习惯的乘法是十进制,但计算机实现起来不方便。首先,需要记录9×9乘法表,每次相乘去表中找结果;其次,将竖式相加也不方便。
全栈程序员站长
2022/09/20
2.5K0
计算机组成与设计(六)—— 乘法器[通俗易懂]
Quartus II实验二 运算部件实验:并行乘法器「建议收藏」
如果很多操作步骤忘记可以参考链接: Quartus II实验一 运算部件实验:加法器
全栈程序员站长
2022/11/04
1.5K0
Quartus II实验二 运算部件实验:并行乘法器「建议收藏」
P2P接口Booth乘法器设计描述原理代码实现
- 描述 Booth乘法器是一种使用移位实现的乘法器,实现过程如下,对于乘法: 扩展A的位数为n+1位,添加 ,则A变为: 从i=0开始,到i=n-1结束,依次考察 的值,做如下操作:
月见樽
2018/12/12
6880
cordic的FPGA实现(三)、乘法器实现
当CORDIC运算在齐次线性坐标系下时,可使用CORDIC实现乘法运算,这只乘法器有一些弊端,就是输入z只能是介于-2~2之间。
根究FPGA
2020/06/30
7380
cordic的FPGA实现(三)、乘法器实现
~0 == -1 问题全解
小年,并非专指一个日子,由于各地风俗,被称为“小年”的日子也不尽相同。小年期间主要的民俗活动有扫尘、祭灶等。民间传统上的小年(扫尘、祭灶日)是腊月二十四,南方大部分地区,仍然保持着腊月二十四过小年的古老传统。从清朝中后期开始,帝王家就于腊月二十三举行祭天大典,为了“节省开支”,顺便把灶王爷也给拜了,因此北方地区百姓随之效仿,提前一天在腊月二十三过小年。
Jasonangel
2021/05/28
6010
Verilog学习笔记——有符号数的乘法和加法
编写程序如下,其中,乘法的两个乘数分别是无符号、有符号的四种组合,输出的积也是分为无符号和有符号,共计 8 种可能;
FPGA探索者
2021/03/15
8.4K0
补码加、减运算规则「建议收藏」
在计算机中,通常总是用补码完成算术的加减法运算。其规则是:   [X+Y]补= [X]补 + [Y]补 ,[X-Y]补= [X]补 – [Y]补 = [X]补 + [-Y]补
全栈程序员站长
2022/08/02
4.9K0
如何通过二进制位运算实现加减乘除
众所周知,计算机是通过 bit 位来存储数字的,因为每个 bit 位只能存储 0 或 1,因此,计算机底层的所有计算都是基于二进制来实现的。 那么,仅仅通过位运算,如何才能计算出数字的加减乘除呢?这是一个非常有意思的问题。 本文我们就来详细介绍一下。
用户3147702
2022/06/27
1.2K0
如何通过二进制位运算实现加减乘除
计算机组成原理:第二章 运算法和运算器
将数据分为纯整数和纯小数两类,用n+1位表示一个定点数,x_n为符号位,放在最左边,0表示正号,1表示负号。故一个数 x 可以表示为 x = x_nx_{n-1}…x_1x_0
Here_SDUT
2022/08/08
3.9K0
计算机组成原理:第二章 运算法和运算器
原码、反码、补码的正(nao)确(can)打开方式
IT可乐
2018/01/04
1.3K0
原码、反码、补码的正(nao)确(can)打开方式
五分钟搞不定系列- 1+1=?
甄建勇,高级架构师(某国际大厂),十年以上半导体从业经验。主要研究领域:CPU/GPU/NPU架构与微架构设计。感兴趣领域:经济学、心理学、哲学。
Linux阅码场
2021/12/09
1.3K0
五分钟搞不定系列- 1+1=?
推荐阅读
相关推荐
【自己动手画CPU】运算器设计
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档