huggingface的transformers在我写下本文时已有39.5k star,可能是目前最流行的深度学习库了,而这家机构又提供了datasets这个库,帮助快速获取和处理数据。这一套全家桶使得整个使用BERT类模型机器学习流程变得前所未有的简单。
不过,目前我在网上没有发现比较简单的关于整个一套全家桶的使用教程。所以写下此文,希望帮助更多人快速上手。
这里,我们以AGNews
新闻分类任务为例,演示整套流程的实现。
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2" # 在此我指定使用2号GPU,可根据需要调整
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import Trainer, TrainingArguments
from transformers import pipeline
from datasets import load_dataset
下面的代码读取原始数据集的train部分的前40000条作为我们的训练集,40000-50000条作为开发集(只使用这个子集已经可以训出不错的模型,并且可以让训练时间更短),原始的测试集作为我们的测试集。
train_dataset = load_dataset("ag_news", split="train[:40000]")
dev_dataset = load_dataset("ag_news", split="train[40000:50000]")
test_dataset = load_dataset("ag_news", split="test")
print(train_dataset)
print(dev_dataset)
print(test_dataset)
Dataset({
features: ['text', 'label'],
num_rows: 40000
})
Dataset({
features: ['text', 'label'],
num_rows: 10000
})
Dataset({
features: ['text', 'label'],
num_rows: 7600
})
原始数据集包含text和label两个字段
train_dataset.features
{'text': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=4, names=['World', 'Sports', 'Business', 'Sci/Tech'], names_file=None, id=None)}
由于bert模型期望得到的标签的字段为labels
而原始数据集中的名字是label
,所以做一下调整。
下面的代码把label
字段复制到labels
。
train_dataset = train_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
train_dataset[0]
{'label': 2,
'labels': 2,
'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."}
dev_dataset = dev_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
test_dataset = test_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
为了快速实验,我们选择一个较小的bert-tiny
模型进行实验。加载对应模型和tokenizer
model_id = 'prajjwal1/bert-tiny'
# note that we need to specify the number of classes for this task
# we can directly use the metadata (num_classes) stored in the dataset
model = AutoModelForSequenceClassification.from_pretrained(model_id,
num_labels=train_dataset.features["label"].num_classes)
tokenizer = AutoTokenizer.from_pretrained(model_id)
用bert的方法对数据集做分词预处理,把所有序列补充或截断到256个token
MAX_LENGTH = 256
train_dataset = train_dataset.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length', max_length=MAX_LENGTH), batched=True)
dev_dataset = dev_dataset.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length', max_length=MAX_LENGTH), batched=True)
test_dataset = test_dataset.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length', max_length=MAX_LENGTH), batched=True)
为了放进pytorch模型训练,还要再声明格式和使用的字段
train_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
dev_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
test_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
现在我们的训练样本长这样,可以直接放进bert训练了
train_dataset.features
{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'label': ClassLabel(num_classes=4, names=['World', 'Sports', 'Business', 'Sci/Tech'], names_file=None, id=None),
'labels': Value(dtype='int64', id=None),
'text': Value(dtype='string', id=None),
'token_type_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
train_dataset[0]
{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
'input_ids': tensor([ 101, 2813, 2358, 1012, 6468, 15020, 2067, 2046, 1996, 2304,
1006, 26665, 1007, 26665, 1011, 2460, 1011, 19041, 1010, 2813,
2395, 1005, 1055, 1040, 11101, 2989, 1032, 2316, 1997, 11087,
1011, 22330, 8713, 2015, 1010, 2024, 3773, 2665, 2153, 1012,
102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]),
'labels': tensor(2),
'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}
我们可以指定模型训练时,显示的验证指标
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
指定训练参数,使用trainer直接训练
training_args = TrainingArguments(
output_dir='./results', # output directory
learning_rate=3e-4,
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=64, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
logging_dir='./logs', # directory for storing logs
logging_steps=100,
do_train=True,
do_eval=True,
no_cuda=False,
load_best_model_at_end=True,
# eval_steps=100,
evaluation_strategy="epoch"
)
trainer = Trainer(
model=model, # the instantiated ? Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=dev_dataset, # evaluation dataset
compute_metrics=compute_metrics
)
train_out = trainer.train()
文章中不能显示那个数据表格,但是在训练过程中,或者results/checkpoint-XXX
下的trainer_state.json
可以看到,这个模型在第二次epoch达到了0.899的F1。
pipeline
可以直接加载训练好的模型和tokenizer,然后直接对文本进行分类预测,无需再自行预处理
首先我们把模型放回cpu来进行预测
model = model.cpu()
用sentiment-analysis
来指定我们做的是文本分类任务(情感分析是一类代表性的文本分类任务),并指定我们之前训好的模型。
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
我们从模型没有见过的test集里挑一个例子来进行预测
test_examples = load_dataset("ag_news", split="test[:10]")
test_examples[0]
{'label': 2,
'text': "Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."}
该文本的类别为2,看看模型能不能做出正确预测?
result = classifier(test_examples[0]['text'])
result
[{'label': 'LABEL_2', 'score': 0.9601152539253235}]
预测正确!
到此我们的huggingface全家桶就大功告成了~
本文的完全代码可以直接在这里找到:https://github.com/blmoistawinde/hello_world/blob/master/huggingface_classification.ipynb
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有