out-of-memory issues if you use the same hyperparameters described in the paper,调整以下参数:
max_seq_length: 训好的模型用...Optimizer: 训好的模型用Adam, requires a lot of extra memory for the m and v vectors....with masked language modeling head and next sentence prediction classifier on top (fully pre-trained),
BertForSequenceClassification...is pre-trained, the multiple choice classification head is only initialized and has to be trained),
BertForTokenClassification...源码中:
__init__(params, lr=required, warmup=-1, t_total=-1, schedule='warmup_linear', betas=(0.9, 0.999