前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >解决RL好奇心探索后遗忘的问题

解决RL好奇心探索后遗忘的问题

作者头像
CreateAMind
发布2019-03-06 15:09:29
6570
发布2019-03-06 15:09:29
举报
文章被收录于专栏:CreateAMind

Amplifying the Imitation Effect for Reinforcement Learning of

UCAV’s Mission Execution

Gyeong Taek Lee 1 Chang Ouk Kim 1

Abstract

This paper proposes a new reinforcement learning

(RL) algorithm that enhances exploration by amplifying the imitation effect (AIE). This algorithm

consists of self-imitation learning and random network distillation algorithms. We argue that these

two algorithms complement each other and that

combining these two algorithms can amplify the

imitation effect for exploration. In addition, by

adding an intrinsic penalty reward to the state that

the RL agent frequently visits and using replay

memory for learning the feature state when using an exploration bonus, the proposed approach

leads to deep exploration and deviates from the

current converged policy. We verified the exploration performance of the algorithm through experiments in a two-dimensional grid environment.

In addition, we applied the algorithm to a simulated environment of unmanned combat aerial

vehicle (UCAV) mission execution, and the empirical results show that AIE is very effective for

finding the UCAV’s shortest flight path to avoid

an enemy’s missiles

4. AIE

4.1. Combining SIL and RND

In this section, we explain why combining RND and SIL can amplify the imitation effect and lead to deep exploration. The SIL updates only when the past R is greater than the current Vθ and imitates past decisions. Intuitively, if we combine SIL and RND, we find that the (R − Vθ) value is larger than the SIL because of the exploration bonus. In the process of optimizing the actor-critic network to maximize Rt = Σ∞k=tγk−t(it + et)k, where it is intrin- sic reward and et is extrinsic reward, the increase in it by the predictor network causes R to increase. That is, the learning progresses by weighting the good decisions of the

past. This type of learning thoroughly reviews the learn- ing history.If the policy starts to converge as the learning progresses, the it will be lower for the state that was fre- quently visited. One might think that learning can be slower as (Rt − Vθ) > (Rt+k − Vθ), where k > 0 for the same state and it decreases. However, the SIL exploits past good decisions and leads to deep exploration. By adding an ex- ploration bonus, the agent can further explore novel states. Consequently, the exploration bonus is likely to continue to occur. In addition, using the prioritized experience replay (Schaul et al., 2015), the sampling probability is determined by the (R − Vθ); thus, there is a high probability that the SIL will exploit the previous transition even if it decreases. In other words, the two algorithms are complementary to each other, and the SIL is immune to the phenomenon in which the prediction error (it) no longer occurs.

4.2. Intrinsic Penalty Reward

Adding an exploration bonus to a novel state that the agent visits is clearly an effective exploration method. However, when the policy and predictor networks converge, there is no longer an exploration bonus for the novel state. In other words, the exploration bonus method provides a reward when the agent itself performs an unexpected action, not when the agent is induced to take the unexpected action. Therefore, an exploration method that entices the agent to take unexpected behavior is necessary. We propose a method to provide an intrinsic penalty reward for an action when it frequently visits the same state rather than reward- ing it when the agent makes an unexpected action. The intrinsic penalty reward allows the agent to escape from the converged local policy and helps to experience diverse policies. Specifically, we provide a penalty by transform- ing the current intrinsic reward into λlog(it), where λ is a penalty weight parameter, if the current intrinsic reward is less than the quantile α of the past N intrinsic rewards. This reward mechanism prevents the agent from staying in the same policy. In addition, adding a penalty to the intrinsic reward indirectly amplifies the imitation effect. Since the (Rt − Vθ ) becomes smaller due to the penalty, the probabil- ity of sampling in replay memory is relatively smaller than that of non-penalty transition. SIL updates are more likely to exploit non-penalty transitions. Even if (Rt − Vθ ) < 0 due to a penalty, it does not affect SIL because it is not updated because of the objective of SIL in equation 4. In other words, the intrinsic penalty reward allows the policy network to deviate from the constantly visited state of the agent and indirectly amplifies the imitation effect for the SIL.

4.3. Catastrophic Forgetting in RND

The predictor network in RND mainly learns about the state that the agent recently visited, which is similar to the

catastrophic forgetting of continual task learning that forgets learned knowledge of previous tasks. If the prediction error increases for a state that the agent has visited before, the agent may recognize the previous state as a novel state. Consequently, an agent cannot effectively explore. The method to mitigate this phenomenon is simple but effective. We store the output of the target network and state feature as the memory of the predictor network, just like using a replay memory to reduce the correlation between samples(Mnih et al., 2013), and train the predictor network in a batch mode. Using the predictor memory reduces the prediction error of states that the agent previously visited, which is why the agent is more likely to explore novel states. Even if the agent returns to a past policy, the prediction error of the state visited by the policy is low, intrinsic penalty is given to the state, and the probability of escaping from the state is high.

https://github.com/junhyukoh/self-imitation-learning

https://github.com/openai/random-network-distillation

代码都有了...........

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-01-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 CreateAMind 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档