我正在尝试创建一个基本的深度Q学习神经网络,用于玩双人对战德克萨斯hold 'em。
对于给定的状态,模型必须为一组可能的动作生成概率分布,在本例中,该概率分布已被简化为折叠、检查或激进选项(看涨/下注/提高)。我计划如何训练模型的细节与这个问题的目的无关。
由于模型权重是随机初始化的,偶尔模型会尝试执行非法移动(例如,尝试在对手举起时进行“检查”,而不是呼叫或折叠)。当这种情况发生时,我想要惩罚网络,并让它更新其权重,以便从模型的指令表中“过滤掉”非法移动。
在这种情况下,我将为模型指定损失1.0,并尝试计算损失和模型参数之间的梯度。但由于某些原因,每当我尝试这样做时,梯度总是以“无”结束。这意味着什么,我做错了什么?
当涉及到tensorflow中的许多事情时,我相对天真,所以如果你不能假设太多的tensorflow知识并简化更高级的概念,这对我来说将是非常有帮助的。
下面是我正在讨论的示例的代码:
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import Sequential, Input
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
import numpy as np
# Creating a linear model with a simple structure. The input size represents a combination of vectors that
## will be used to create the input state.
model = Sequential()
model.add(Input(shape=(52 + 52 + 2 + 3 + 3), batch_size=1))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(3))
# The model will produce probability values for choosing from three output actions (fold, check, call/bet/raise.) 
model.add(Activation('softmax'))
model.compile(loss="mse", optimizer=Adam(lr=0.001), metrics=['accuracy'])
# Bot hand is a 1-D boolean array of size 52. Each index represents a unique card, and a value of 1 in that index
## indicates the card is present in the hand.
bot_hand = np.zeros((52))
# For this example, we'll assume cards with ids 0 and 1 are in the bot's hand.
bot_hand[:2] = 1
# The same applies to the cards on the table; a 1-D boolean array of size 52.
table_cards = np.zeros((52))
# In this example, the bot has bet $2.
bot_bet_size = np.array([2])
# The opponent has bet $4.
opponent_bet_size = np.array([4])
# The last action the bot has ever made is represented as a 1-D boolean vector of size 3. A value of 1 in an 
## index represents which action has been taken. 0=fold, 1=check, 2=bet/raise.
bot_previous_action = np.zeros((3))
# The opponent's previous action is represented in the same way.
opponent_previous_action = np.zeros((3))
# In this example, the opponent's last action was to raise to $4.
opponent_previous_action[2] = 2
# The 'state' is represented by all of the above vectors.
bot_state = np.concatenate((bot_hand, table_cards, bot_bet_size, opponent_bet_size, bot_previous_action, 
                           opponent_previous_action), axis=0)
# The action taken is determined by which of the Softmax output nodes produces the highest input.
model_output = model.predict(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
bot_decision = np.argmax(model_output)
                         
# Let's assume in this case the bot chose action '1' (check), which is an illegal move, since the opponents has
## raised. I am attempting to punish the network by assigning it a loss value of 1, and backpropagating to
## update the weights.
with tf.GradientTape() as t:
    # Creating dummy output 
    correct_output = model_output - 1
    # Ensuring that the loss is equal to 1.
    loss = tf.keras.losses.mse(correct_output, model_output)
gradients = t.gradient(loss, model.trainable_variables)'gradients‘变量总是看起来像[None, None, None, None, None, None]
我真的很感谢一些关于如何解决这个问题的建议,或者如何以不同的方式解决非法移动的问题。
发布于 2020-07-03 15:57:43
解决了。有多个问题导致了这个问题,我将在这里列出所有这些问题,以及代码解决方案。
watch()方法添加到GradientTape块中,以便专门告诉GradientTape跟踪模型的model.trainable_variables.model(input)获得的张量,而不是模型的model.predict(input).GradientTape内计算,而不是在其外部计算。<代码>H215<代码>H116<代码>D17对象需要是张量而不是数值数组。<代码>H218<代码>H119 MSE损失函数不适合此模型。因为它有一个Softmax输出层,并且在架构上更接近分类网络。而是使用了categorical_crossentropy损失函数。完整代码:
with tf.GradientTape(watch_accessed_variables=False) as t:
    t.watch(model.trainable_variables)
    model_output = model(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
    # Creating dummy output - action 1 is illegal
    correct_output = tf.convert_to_tensor([[0.5, 0, 0.5]])
    # Calculating loss
    loss = tf.keras.losses.categorical_crossentropy(correct_output, model_output)
gradients = t.gradient(loss, model.trainable_variables)https://stackoverflow.com/questions/62703974
复制相似问题