Amplifying the Imitation Effect for Reinforcement Learning of
UCAV’s Mission Execution
Gyeong Taek Lee 1 Chang Ouk Kim 1
Abstract
This paper proposes a new reinforcement learning
(RL) algorithm that enhances exploration by amplifying the imitation effect (AIE). This algorithm
consists of self-imitation learning and random network distillation algorithms. We argue that these
two algorithms complement each other and that
combining these two algorithms can amplify the
imitation effect for exploration. In addition, by
adding an intrinsic penalty reward to the state that
the RL agent frequently visits and using replay
memory for learning the feature state when using an exploration bonus, the proposed approach
leads to deep exploration and deviates from the
current converged policy. We verified the exploration performance of the algorithm through experiments in a two-dimensional grid environment.
In addition, we applied the algorithm to a simulated environment of unmanned combat aerial
vehicle (UCAV) mission execution, and the empirical results show that AIE is very effective for
finding the UCAV’s shortest flight path to avoid
an enemy’s missiles
4. AIE
4.1. Combining SIL and RND
In this section, we explain why combining RND and SIL can amplify the imitation effect and lead to deep exploration. The SIL updates only when the past R is greater than the current Vθ and imitates past decisions. Intuitively, if we combine SIL and RND, we find that the (R − Vθ) value is larger than the SIL because of the exploration bonus. In the process of optimizing the actor-critic network to maximize Rt = Σ∞k=tγk−t(it + et)k, where it is intrin- sic reward and et is extrinsic reward, the increase in it by the predictor network causes R to increase. That is, the learning progresses by weighting the good decisions of the
past. This type of learning thoroughly reviews the learn- ing history.If the policy starts to converge as the learning progresses, the it will be lower for the state that was fre- quently visited. One might think that learning can be slower as (Rt − Vθ) > (Rt+k − Vθ), where k > 0 for the same state and it decreases. However, the SIL exploits past good decisions and leads to deep exploration. By adding an ex- ploration bonus, the agent can further explore novel states. Consequently, the exploration bonus is likely to continue to occur. In addition, using the prioritized experience replay (Schaul et al., 2015), the sampling probability is determined by the (R − Vθ); thus, there is a high probability that the SIL will exploit the previous transition even if it decreases. In other words, the two algorithms are complementary to each other, and the SIL is immune to the phenomenon in which the prediction error (it) no longer occurs.
4.2. Intrinsic Penalty Reward
Adding an exploration bonus to a novel state that the agent visits is clearly an effective exploration method. However, when the policy and predictor networks converge, there is no longer an exploration bonus for the novel state. In other words, the exploration bonus method provides a reward when the agent itself performs an unexpected action, not when the agent is induced to take the unexpected action. Therefore, an exploration method that entices the agent to take unexpected behavior is necessary. We propose a method to provide an intrinsic penalty reward for an action when it frequently visits the same state rather than reward- ing it when the agent makes an unexpected action. The intrinsic penalty reward allows the agent to escape from the converged local policy and helps to experience diverse policies. Specifically, we provide a penalty by transform- ing the current intrinsic reward into λlog(it), where λ is a penalty weight parameter, if the current intrinsic reward is less than the quantile α of the past N intrinsic rewards. This reward mechanism prevents the agent from staying in the same policy. In addition, adding a penalty to the intrinsic reward indirectly amplifies the imitation effect. Since the (Rt − Vθ ) becomes smaller due to the penalty, the probabil- ity of sampling in replay memory is relatively smaller than that of non-penalty transition. SIL updates are more likely to exploit non-penalty transitions. Even if (Rt − Vθ ) < 0 due to a penalty, it does not affect SIL because it is not updated because of the objective of SIL in equation 4. In other words, the intrinsic penalty reward allows the policy network to deviate from the constantly visited state of the agent and indirectly amplifies the imitation effect for the SIL.
4.3. Catastrophic Forgetting in RND
The predictor network in RND mainly learns about the state that the agent recently visited, which is similar to the
catastrophic forgetting of continual task learning that forgets learned knowledge of previous tasks. If the prediction error increases for a state that the agent has visited before, the agent may recognize the previous state as a novel state. Consequently, an agent cannot effectively explore. The method to mitigate this phenomenon is simple but effective. We store the output of the target network and state feature as the memory of the predictor network, just like using a replay memory to reduce the correlation between samples(Mnih et al., 2013), and train the predictor network in a batch mode. Using the predictor memory reduces the prediction error of states that the agent previously visited, which is why the agent is more likely to explore novel states. Even if the agent returns to a past policy, the prediction error of the state visited by the policy is low, intrinsic penalty is given to the state, and the probability of escaping from the state is high.
https://github.com/junhyukoh/self-imitation-learning
https://github.com/openai/random-network-distillation
代码都有了...........