文本解码原理--MindNLP

查拉图斯特拉说

发布于 2024-07-29 00:37:14

12100

代码可运行

文章被收录于专栏：后端架构后端架构

运行总次数：0

代码可运行

前言

根据前文预测下一个单词

一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积

Greedy search

在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:

𝑤𝑡=𝑎𝑟𝑔𝑚𝑎𝑥_𝑤 𝑃(𝑤|𝑤(1:𝑡−1))

按照贪心搜索输出序列("The","nice","woman") 的条件概率为：0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词，如：dog=0.5, has=0.9 ![image.png](attachment:image.png =600x600)

运行代码

#greedy_search

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

选词规则

举个例子

将采样池限制为固定大小 K ：

在分布比较尖锐的时候产生胡言乱语
在分布比较平坦的时候限制模型的创造力

总结

本文介绍了自回归语言模型的原理及文本生成方法，包括贪心搜索、Beam搜索和采样等。贪心搜索选择每个时间步最高概率的词，容易错过潜在高概率序列；Beam搜索保留多个可能的词序列，能改善结果但仍存在重复问题。采样方法通过随机选择词生成多样化文本，但可能导致文本连贯性不足。整体来看，这些方法在文本生成中各有优缺点，需要根据实际应用进行选择和调整。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2024-07-27，如有侵权请联系 cloudcommunity@tencent.com 删除

output