文章/答案/技术大牛

发布

社区首页 >问答首页 >Tensorflow 2.2.0强化学习-模型参数的梯度为None

问Tensorflow 2.2.0强化学习-模型参数的梯度为None
EN

Stack Overflow用户

提问于 2020-07-03 03:45:23

回答 1查看 98关注 0票数 0

我正在尝试创建一个基本的深度Q学习神经网络，用于玩双人对战德克萨斯hold 'em。

对于给定的状态，模型必须为一组可能的动作生成概率分布，在本例中，该概率分布已被简化为折叠、检查或激进选项(看涨/下注/提高)。我计划如何训练模型的细节与这个问题的目的无关。

由于模型权重是随机初始化的，偶尔模型会尝试执行非法移动(例如，尝试在对手举起时进行“检查”，而不是呼叫或折叠)。当这种情况发生时，我想要惩罚网络，并让它更新其权重，以便从模型的指令表中“过滤掉”非法移动。

在这种情况下，我将为模型指定损失1.0，并尝试计算损失和模型参数之间的梯度。但由于某些原因，每当我尝试这样做时，梯度总是以“无”结束。这意味着什么，我做错了什么？

当涉及到tensorflow中的许多事情时，我相对天真，所以如果你不能假设太多的tensorflow知识并简化更高级的概念，这对我来说将是非常有帮助的。

下面是我正在讨论的示例的代码：

from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import Sequential, Input
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
import numpy as np


# Creating a linear model with a simple structure. The input size represents a combination of vectors that
## will be used to create the input state.
model = Sequential()
model.add(Input(shape=(52 + 52 + 2 + 3 + 3), batch_size=1))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(3))
# The model will produce probability values for choosing from three output actions (fold, check, call/bet/raise.) 
model.add(Activation('softmax'))
model.compile(loss="mse", optimizer=Adam(lr=0.001), metrics=['accuracy'])

# Bot hand is a 1-D boolean array of size 52. Each index represents a unique card, and a value of 1 in that index
## indicates the card is present in the hand.
bot_hand = np.zeros((52))

# For this example, we'll assume cards with ids 0 and 1 are in the bot's hand.
bot_hand[:2] = 1

# The same applies to the cards on the table; a 1-D boolean array of size 52.
table_cards = np.zeros((52))

# In this example, the bot has bet $2.
bot_bet_size = np.array([2])

# The opponent has bet $4.
opponent_bet_size = np.array([4])

# The last action the bot has ever made is represented as a 1-D boolean vector of size 3. A value of 1 in an 
## index represents which action has been taken. 0=fold, 1=check, 2=bet/raise.
bot_previous_action = np.zeros((3))

# The opponent's previous action is represented in the same way.
opponent_previous_action = np.zeros((3))

# In this example, the opponent's last action was to raise to $4.
opponent_previous_action[2] = 2

# The 'state' is represented by all of the above vectors.
bot_state = np.concatenate((bot_hand, table_cards, bot_bet_size, opponent_bet_size, bot_previous_action, 
                           opponent_previous_action), axis=0)

# The action taken is determined by which of the Softmax output nodes produces the highest input.
model_output = model.predict(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
bot_decision = np.argmax(model_output)
                         
# Let's assume in this case the bot chose action '1' (check), which is an illegal move, since the opponents has
## raised. I am attempting to punish the network by assigning it a loss value of 1, and backpropagating to
## update the weights.
with tf.GradientTape() as t:
    # Creating dummy output 
    correct_output = model_output - 1
    # Ensuring that the loss is equal to 1.
    loss = tf.keras.losses.mse(correct_output, model_output)

gradients = t.gradient(loss, model.trainable_variables)

'gradients‘变量总是看起来像[None, None, None, None, None, None]

我真的很感谢一些关于如何解决这个问题的建议，或者如何以不同的方式解决非法移动的问题。

python-3.x

tensorflow

neural-network

reinforcement-learning

poker

回答 1

Stack Overflow用户

发布于 2020-07-03 15:57:43

解决了。有多个问题导致了这个问题，我将在这里列出所有这些问题，以及代码解决方案。

需要将watch()方法添加到GradientTape块中，以便专门告诉GradientTape跟踪模型的model.trainable_variables.
The输出的梯度需要是model(input)获得的张量，而不是模型的model.predict(input).
The输出获得的数值数组也需要在 GradientTape内计算，而不是在其外部计算。<代码>H215<代码>H116<代码>D17对象需要是张量而不是数值数组。<代码>H218<代码>H119 MSE损失函数不适合此模型。因为它有一个Softmax输出层，并且在架构上更接近分类网络。而是使用了categorical_crossentropy损失函数。

完整代码：

with tf.GradientTape(watch_accessed_variables=False) as t:
    t.watch(model.trainable_variables)
    model_output = model(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
    # Creating dummy output - action 1 is illegal
    correct_output = tf.convert_to_tensor([[0.5, 0, 0.5]])
    # Calculating loss
    loss = tf.keras.losses.categorical_crossentropy(correct_output, model_output)

gradients = t.gradient(loss, model.trainable_variables)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62703974

复制

相似问题

问Tensorflow 2.2.0强化学习-模型参数的梯度为None
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Tensorflow 2.2.0强化学习-模型参数的梯度为NoneEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Tensorflow 2.2.0强化学习-模型参数的梯度为None
EN