首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >Tensorflow 2.2.0强化学习-模型参数的梯度为None

Tensorflow 2.2.0强化学习-模型参数的梯度为None
EN

Stack Overflow用户
提问于 2020-07-03 03:45:23
回答 1查看 98关注 0票数 0

我正在尝试创建一个基本的深度Q学习神经网络,用于玩双人对战德克萨斯hold 'em。

对于给定的状态,模型必须为一组可能的动作生成概率分布,在本例中,该概率分布已被简化为折叠、检查或激进选项(看涨/下注/提高)。我计划如何训练模型的细节与这个问题的目的无关。

由于模型权重是随机初始化的,偶尔模型会尝试执行非法移动(例如,尝试在对手举起时进行“检查”,而不是呼叫或折叠)。当这种情况发生时,我想要惩罚网络,并让它更新其权重,以便从模型的指令表中“过滤掉”非法移动。

在这种情况下,我将为模型指定损失1.0,并尝试计算损失和模型参数之间的梯度。但由于某些原因,每当我尝试这样做时,梯度总是以“无”结束。这意味着什么,我做错了什么?

当涉及到tensorflow中的许多事情时,我相对天真,所以如果你不能假设太多的tensorflow知识并简化更高级的概念,这对我来说将是非常有帮助的。

下面是我正在讨论的示例的代码:

代码语言:javascript
运行
复制
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import Sequential, Input
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
import numpy as np


# Creating a linear model with a simple structure. The input size represents a combination of vectors that
## will be used to create the input state.
model = Sequential()
model.add(Input(shape=(52 + 52 + 2 + 3 + 3), batch_size=1))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(3))
# The model will produce probability values for choosing from three output actions (fold, check, call/bet/raise.) 
model.add(Activation('softmax'))
model.compile(loss="mse", optimizer=Adam(lr=0.001), metrics=['accuracy'])

# Bot hand is a 1-D boolean array of size 52. Each index represents a unique card, and a value of 1 in that index
## indicates the card is present in the hand.
bot_hand = np.zeros((52))

# For this example, we'll assume cards with ids 0 and 1 are in the bot's hand.
bot_hand[:2] = 1

# The same applies to the cards on the table; a 1-D boolean array of size 52.
table_cards = np.zeros((52))

# In this example, the bot has bet $2.
bot_bet_size = np.array([2])

# The opponent has bet $4.
opponent_bet_size = np.array([4])

# The last action the bot has ever made is represented as a 1-D boolean vector of size 3. A value of 1 in an 
## index represents which action has been taken. 0=fold, 1=check, 2=bet/raise.
bot_previous_action = np.zeros((3))

# The opponent's previous action is represented in the same way.
opponent_previous_action = np.zeros((3))

# In this example, the opponent's last action was to raise to $4.
opponent_previous_action[2] = 2

# The 'state' is represented by all of the above vectors.
bot_state = np.concatenate((bot_hand, table_cards, bot_bet_size, opponent_bet_size, bot_previous_action, 
                           opponent_previous_action), axis=0)

# The action taken is determined by which of the Softmax output nodes produces the highest input.
model_output = model.predict(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
bot_decision = np.argmax(model_output)
                         
# Let's assume in this case the bot chose action '1' (check), which is an illegal move, since the opponents has
## raised. I am attempting to punish the network by assigning it a loss value of 1, and backpropagating to
## update the weights.
with tf.GradientTape() as t:
    # Creating dummy output 
    correct_output = model_output - 1
    # Ensuring that the loss is equal to 1.
    loss = tf.keras.losses.mse(correct_output, model_output)

gradients = t.gradient(loss, model.trainable_variables)

'gradients‘变量总是看起来像[None, None, None, None, None, None]

我真的很感谢一些关于如何解决这个问题的建议,或者如何以不同的方式解决非法移动的问题。

EN

回答 1

Stack Overflow用户

发布于 2020-07-03 15:57:43

解决了。有多个问题导致了这个问题,我将在这里列出所有这些问题,以及代码解决方案。

  1. 需要将watch()方法添加到GradientTape块中,以便专门告诉GradientTape跟踪模型的model.trainable_variables.
  2. The输出的梯度需要是model(input)获得的张量,而不是模型的model.predict(input).
  3. The输出获得的数值数组也需要在 GradientTape内计算,而不是在其外部计算。<代码>H215<代码>H116<代码>D17对象需要是张量而不是数值数组。<代码>H218<代码>H119 MSE损失函数不适合此模型。因为它有一个Softmax输出层,并且在架构上更接近分类网络。而是使用了categorical_crossentropy损失函数。

完整代码:

代码语言:javascript
运行
复制
with tf.GradientTape(watch_accessed_variables=False) as t:
    t.watch(model.trainable_variables)
    model_output = model(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
    # Creating dummy output - action 1 is illegal
    correct_output = tf.convert_to_tensor([[0.5, 0, 0.5]])
    # Calculating loss
    loss = tf.keras.losses.categorical_crossentropy(correct_output, model_output)

gradients = t.gradient(loss, model.trainable_variables)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62703974

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档