转自:机器之心编辑部
能够灵活地调用各种语言模型,一直是 NLP 研究者的期待。近日 HuggingFace 公司开源了最新的 Transformer2.0 模型库,用户可非常方便地调用现在非常流行的 8 种语言模型进行微调和应用,且同时兼容 TensorFlow2.0 和 PyTorch 两大框架,非常方便快捷。
最近,专注于自然语言处理(NLP)的初创公司 HuggingFace 对其非常受欢迎的 Transformers 库进行了重大更新,从而为 PyTorch 和 Tensorflow 2.0 两大深度学习框架提供了前所未有的兼容性。
更新后的 Transformers 2.0 汲取了 PyTorch 的易用性和 Tensorflow 的工业级生态系统。借助于更新后的 Transformers 库,科学家和实践者可以更方便地在开发同一语言模型的训练、评估和制作阶段选择不同的框架。
那么更新后的 Transformers 2.0 具有哪些显著的特征呢?对 NLP 研究者和实践者又会带来哪些方面的改善呢?机器之心进行了整理。
项目地址:https://github.com/huggingface/transformers
Transformers 2.0 新特性
为所有人提供 SOTA 自然语言处理
更低的计算开销和更少的碳排放量
为模型使用期限内的每个阶段选择正确的框架
现已支持的模型
官方提供了一个支持的模型列表,包括各种著名的预训练语言模型和变体,甚至还有官方实现的一个蒸馏后的 Bert 模型:
1. BERT (https://github.com/google-research/bert)
2. GPT (https://github.com/openai/finetune-transformer-lm)
3. GPT-2 (https://blog.openai.com/better-language-models/)
4. Transformer-XL (https://github.com/kimiyoung/transformer-xl)
5. XLNet (https://github.com/zihangdai/xlnet/)
6. XLM (https://github.com/facebookresearch/XLM/)
7. RoBERTa (https://github.com/pytorch/fairseq/tree/master/examples/roberta)
8. DistilBERT (https://github.com/huggingface/transformers/tree/master/examples/distillation)
快速上手
怎样使用 Transformers 工具包呢?官方提供了很多代码示例,以下为查看 Transformer 内部模型的代码:
import torch
from transformers import *
#Transformers has a unified API
#for 8 transformer architectures and 30 pretrained weights.
#Model | Tokenizer | Pretrained weights shortcut
MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),
(OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'),
(GPT2Model, GPT2Tokenizer, 'gpt2'),
(TransfoXLModel, TransfoXLTokenizer, 'transfo-xl-wt103'),
(XLNetModel, XLNetTokenizer, 'xlnet-base-cased'),
(XLMModel, XLMTokenizer, 'xlm-mlm-enfr-1024'),
(DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
(RobertaModel, RobertaTokenizer, 'roberta-base')]
#To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. TFRobertaModel is the TF 2.0 counterpart of the PyTorch model RobertaModel
#Let's encode some text in a sequence of hidden-states using each model:
for model_class, tokenizer_class, pretrained_weights in MODELS:
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
# Encode text
input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)]) # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
with torch.no_grad():
last_hidden_states = model(input_ids)[0] # Models outputs are now tuples
#Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
BertForQuestionAnswering]
#All the classes for an architecture can be initiated from pretrained weights for this architecture
#Note that additional weights added for fine-tuning are only initialized
#and need to be trained on the down-stream tasktokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
for model_class in BERT_MODEL_CLASSES:
# Load pretrained model/tokenizer
model = model_class.from_pretrained('bert-base-uncased')
#Models can return full list of hidden-states & attentions weights at each layer
model = model_class.from_pretrained(pretrained_weights,
output_hidden_states=True,
output_attentions=True)
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
all_hidden_states, all_attentions = model(input_ids)[-2:]
#Models are compatible with Torchscript
model = model_class.from_pretrained(pretrained_weights, torchscript=True)
traced_model = torch.jit.trace(model, (input_ids,))
#Simple serialization for models and tokenizers
model.save_pretrained('./directory/to/save/') # save
model = model_class.from_pretrained('./directory/to/save/') # re-load
tokenizer.save_pretrained('./directory/to/save/') # save
tokenizer = tokenizer_class.from_pretrained('./directory/to/save/') # re-load
#SOTA examples for GLUE, SQUAD, text generation...
Transformers 同时支持 PyTorch 和 TensorFlow2.0,用户可以将这些工具放在一起使用。如下为使用 TensorFlow2.0 和 Transformer 的代码:
import tensorflow as tf
import tensorflow_datasets
from transformers import *
#Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
data = tensorflow_datasets.load('glue/mrpc')
#Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)
#Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
#Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
validation_data=valid_dataset, validation_steps=7)
#Load the TensorFlow model in PyTorch for inspection
model.save_pretrained('./save/')
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
#Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
sentence_0 = "This research was consistent with his findings.“
sentence_1 = "His findings were compatible with this research.“
sentence_2 = "His findings were not compatible with this research.“
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
pred_1 = pytorch_model(*inputs_1)[0].argmax().item()
pred_2 = pytorch_model(*inputs_2)[0].argmax().item()
print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")
使用 py 文件脚本进行模型微调
当然,有时候你可能需要使用特定数据集对模型进行微调,Transformer2.0 项目提供了很多可以直接执行的 Python 文件。例如:
GLUE 任务上进行模型微调
如下为在 GLUE 任务进行微调,使模型可以用于序列分类的示例代码,使用的文件是 run_glue.py。
首先下载 GLUE 数据集,并安装额外依赖:
pip install -r ./examples/requirements.txt
然后可进行微调:
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python ./examples/run_glue.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
在命令行运行时,可以选择特定的模型和相关的训练参数。
使用 SQuAD 数据集微调模型
另外,你还可以试试用 run_squad.py 文件在 SQuAD 数据集上进行微调。代码如下:
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ../models/wwm_uncased_finetuned_squad/ \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3 \
这一代码可微调 BERT 全词 Mask 模型,在 8 个 V100GPU 上微调,使模型的 F1 分数在 SQuAD 数据集上超过 93。
用模型进行文本生成
还可以使用 run_generation.py 让预训练语言模型进行文本生成,代码如下:
python ./examples/run_generation.py \
--model_type=gpt2 \
--length=20 \
--model_name_or_path=gpt2 \
安装方法
如此方便的工具怎样安装呢?用户只要保证环境在 Python3.5 以上,PyTorch 版本在 1.0.0 以上或 TensorFlow 版本为 2.0.0-rc1。
然后使用 pip 安装即可。
pip install transformers
移动端部署很快就到
HuggingFace 在 GitHub 上表示,他们有意将这些模型放到移动设备上,并提供了一个 repo 的代码,将 GPT-2 模型转换为 CoreML 模型放在移动端。
未来,他们会进一步推进开发工作,用户可以无缝地将大模型转换成 CoreML 模型,无需使用额外的程序脚本。
repo 地址:https://github.com/huggingface/swift-coreml-transformers