相关文章:
【一】飞桨paddle【GPU、CPU】安装以及环境配置+python入门教学
代码链接:码云:https://gitee.com/dingding962285595/parl_work ;github:https://github.com/PaddlePaddle/PARL
在强化学习中,有两大类方法,一种基于值(Value-based
),一种基于策略(Policy-based
)
Value-based
的算法的典型代表为Q-learning
和SARSA
,将Q
函数优化到最优,再根据Q
函数取最优策略。Policy-based
的算法的典型代表为Policy Gradient
,直接优化策略函数。softmax转化为概率
采用神经网络拟合策略函数,需计算策略梯度用于优化策略网络。
轨迹改正一下
π(s,a)
的期望回报:所有的轨迹获得的回报R
与对应的轨迹发生概率p
的加权和,当N足够大时,可通过采样N个Episode求平均的方式近似表达。优化策略函数
θ
求导后得到策略梯度:程序里有自带优化器一般都是梯度下降,只要加个负号loss进行转变梯度上升。
拿到一条轨迹,在进行learn,可以得到每个轨迹的reward进行求解期望,未来总收益-------回合更新
表示当前t时刻后续可以拿到的收益,
def calc_reward_to_go(reward_list, gamma=1.0):
for i in range(len(reward_list) - 2, -1, -1):
# G_i = r_i + γ·G_i+1
reward_list[i] += gamma * reward_list[i + 1] # Gt
return np.array(reward_list)
每个step得到的reward转化成未来每个总收益。
,消除外面的
reinforce伪代码:
LOSS得到情况。
def learn(self, obs, action, reward):
""" 用policy gradient 算法更新policy model
"""
act_prob = self.model(obs) # 获取输出动作概率
# log_prob = layers.cross_entropy(act_prob, action) # 交叉熵
log_prob = layers.reduce_sum(
-1.0 * layers.log(act_prob) * layers.one_hot(
action, act_prob.shape[1]),
dim=1)
cost = log_prob * reward
cost = layers.reduce_mean(cost)
optimizer = fluid.optimizer.Adam(self.lr)
optimizer.minimize(cost)
return cost
obs, action, reward分别是S、A、R;先得到预测动作概率然后和onehot(实际动作)相乘;再乘以reward,就得到LOSS。即上面对应的公式。
reduce_mean求出每条轨迹平均的reward然后得到平均loss,然后放入优化器中。【每个轨迹都会对应一个loss,算出所有轨迹的平均loss放到Adam优化器优化】
Agent
把产生的数据传给algorithm
,algorithm
根据model
的模型结构计算出Loss
,使用SGD
或者其他优化器不断的优化,PARL
这种架构可以很方便的应用在各类深度强化学习问题中。对应封装程序:
Model
用来定义前向(Forward
)网络,用户可以自由的定制自己的网络结构。class Model(parl.Model):
def __init__(self, act_dim):
act_dim = act_dim
hid1_size = act_dim * 10
self.fc1 = layers.fc(size=hid1_size, act='tanh')
self.fc2 = layers.fc(size=act_dim, act='softmax')
def forward(self, obs): # 可直接用 model = Model(5); model(obs)调用
out = self.fc1(obs)
out = self.fc2(out)
return out
Algorithm
定义了具体的算法来更新前向网络(Model
),也就是通过定义损失函数来更新Model
,和算法相关的计算都放在algorithm
中。class PolicyGradient(parl.Algorithm):
def __init__(self, model, lr=None):
""" Policy Gradient algorithm
Args:
model (parl.Model): policy的前向网络.
lr (float): 学习率.
"""
self.model = model
assert isinstance(lr, float)
self.lr = lr
def predict(self, obs):
""" 使用policy model预测输出的动作概率
"""
return self.model(obs)
def learn(self, obs, action, reward):
""" 用policy gradient 算法更新policy model
"""
act_prob = self.model(obs) # 获取输出动作概率
# log_prob = layers.cross_entropy(act_prob, action) # 交叉熵 和下一行公式效果相同
log_prob = layers.reduce_sum(
-1.0 * layers.log(act_prob) * layers.one_hot(
action, act_prob.shape[1]),
dim=1)
cost = log_prob * reward
cost = layers.reduce_mean(cost)
optimizer = fluid.optimizer.Adam(self.lr)
optimizer.minimize(cost)
return cost
Agent
负责算法与环境的交互,在交互过程中把生成的数据提供给Algorithm
来更新模型(Model
),数据的预处理流程也一般定义在这里。
class Agent(parl.Agent):
def __init__(self, algorithm, obs_dim, act_dim):
self.obs_dim = obs_dim
self.act_dim = act_dim
super(Agent, self).__init__(algorithm)
def build_program(self):
self.pred_program = fluid.Program()
self.learn_program = fluid.Program()
with fluid.program_guard(self.pred_program): # 搭建计算图用于 预测动作,定义输入输出变量
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
self.act_prob = self.alg.predict(obs)
with fluid.program_guard(
self.learn_program): # 搭建计算图用于 更新policy网络,定义输入输出变量
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
act = layers.data(name='act', shape=[1], dtype='int64')
reward = layers.data(name='reward', shape=[], dtype='float32')
self.cost = self.alg.learn(obs, act, reward)
def sample(self, obs):
obs = np.expand_dims(obs, axis=0) # 增加一维维度
act_prob = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.act_prob])[0]
act_prob = np.squeeze(act_prob, axis=0) # 减少一维维度
act = np.random.choice(range(self.act_dim), p=act_prob) # 根据动作概率选取动作
return act
#self.fluid_executor.run训练器是一个batch训练的,所以要输入时增加一维输出减少一维
def predict(self, obs):
obs = np.expand_dims(obs, axis=0)
act_prob = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.act_prob])[0]
act_prob = np.squeeze(act_prob, axis=0)
act = np.argmax(act_prob) # 根据动作概率选择概率最高的动作
return act
def learn(self, obs, act, reward):
act = np.expand_dims(act, axis=-1)
feed = {
'obs': obs.astype('float32'),
'act': act.astype('int64'),
'reward': reward.astype('float32')
}
cost = self.fluid_executor.run(
self.learn_program, feed=feed, fetch_list=[self.cost])[0]
return cost
def main():
env = gym.make('CartPole-v0')
# env = env.unwrapped # Cancel the minimum score limit
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.n
logger.info('obs_dim {}, act_dim {}'.format(obs_dim, act_dim))
# 根据parl框架构建agent
model = Model(act_dim=act_dim)
alg = PolicyGradient(model, lr=LEARNING_RATE)
agent = Agent(alg, obs_dim=obs_dim, act_dim=act_dim)
# 加载模型
# if os.path.exists('./model.ckpt'):
# agent.restore('./model.ckpt')
# run_episode(env, agent, train_or_test='test', render=True)
# exit()
for i in range(1000):
obs_list, action_list, reward_list = run_episode(env, agent)
if i % 10 == 0:
logger.info("Episode {}, Reward Sum {}.".format(
i, sum(reward_list)))
batch_obs = np.array(obs_list)
batch_action = np.array(action_list)
batch_reward = calc_reward_to_go(reward_list)
agent.learn(batch_obs, batch_action, batch_reward)
if (i + 1) % 100 == 0:
total_reward = evaluate(env, agent, render=True)
logger.info('Test reward: {}'.format(total_reward))
# save the parameters to ./model.ckpt
agent.save('./model.ckpt')
batch_reward = calc_reward_to_go(reward_list) 把reward转变为G_t ,得到所有episode数据后, learn一下计算期望。run_episode计算一次知道done为true,evaluate计算5次episode求平均。
?[32m[03-22 15:38:26 MainThread @train.py:95]?[0m Episode 800, Reward Sum 118.0.
?[32m[03-22 15:38:29 MainThread @train.py:95]?[0m Episode 810, Reward Sum 177.0.
?[32m[03-22 15:38:32 MainThread @train.py:95]?[0m Episode 820, Reward Sum 86.0.
?[32m[03-22 15:38:37 MainThread @train.py:95]?[0m Episode 830, Reward Sum 200.0.
?[32m[03-22 15:38:40 MainThread @train.py:95]?[0m Episode 840, Reward Sum 200.0.
?[32m[03-22 15:38:44 MainThread @train.py:95]?[0m Episode 850, Reward Sum 59.0.
?[32m[03-22 15:38:48 MainThread @train.py:95]?[0m Episode 860, Reward Sum 61.0.
?[32m[03-22 15:38:52 MainThread @train.py:95]?[0m Episode 870, Reward Sum 156.0.
?[32m[03-22 15:38:57 MainThread @train.py:95]?[0m Episode 880, Reward Sum 99.0.
?[32m[03-22 15:39:00 MainThread @train.py:95]?[0m Episode 890, Reward Sum 16.0.
?[32m[03-22 15:39:22 MainThread @train.py:104]?[0m Test reward: 200.0
?[32m[03-22 15:39:23 MainThread @train.py:95]?[0m Episode 900, Reward Sum 165.0.
?[32m[03-22 15:39:27 MainThread @train.py:95]?[0m Episode 910, Reward Sum 141.0.
?[32m[03-22 15:39:32 MainThread @train.py:95]?[0m Episode 920, Reward Sum 200.0.
?[32m[03-22 15:39:37 MainThread @train.py:95]?[0m Episode 930, Reward Sum 200.0.
?[32m[03-22 15:39:42 MainThread @train.py:95]?[0m Episode 940, Reward Sum 200.0.
?[32m[03-22 15:39:47 MainThread @train.py:95]?[0m Episode 950, Reward Sum 200.0.
?[32m[03-22 15:39:52 MainThread @train.py:95]?[0m Episode 960, Reward Sum 113.0.
?[32m[03-22 15:39:58 MainThread @train.py:95]?[0m Episode 970, Reward Sum 152.0.
?[32m[03-22 15:40:03 MainThread @train.py:95]?[0m Episode 980, Reward Sum 200.0.
?[32m[03-22 15:40:09 MainThread @train.py:95]?[0m Episode 990, Reward Sum 180.0.
?[32m[03-22 15:40:30 MainThread @train.py:104]?[0m Test reward: 200.0
可以看到在训练过程得到的reward 分值不高是因为 选取动作采用随机性,但是在检验的时候是选择概率最大的动作所以reward最大。随这参数迭代,训练选择最大动作概率也会越来越大。