我正在努力理解如何为ELMo矢量化编写段落。
文档只显示如何在同一时间嵌入多个句子/单词。
例如:
sentences = [["the", "cat", "is", "on", "the", "mat"],
["dogs", "are", "in", "the", "fog", ""]]
elmo(
inputs={
"tokens": sentences,
"sequence_len": [6, 5]
},
signature="tokens",
as_dict=True
)["elmo"]
据我所知,这将返回两个向量,每个向量代表一个给定的句子。我将如何准备输入数据,以向量化包含多个句子的整个段落。请注意,我希望使用自己的预处理。
可以这样做吗?
sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>",
"<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]
或者像这样?
sentences = [["the", "cat", "is", "on", "the", "mat", ".",
"dogs", "are", "in", "the", "fog", "."]]
发布于 2018-12-01 11:42:31
ELMo生成上下文词向量。因此,与单词相对应的词向量是单词和上下文的函数,例如,它出现在句子中。
就像文档中的例子一样,您希望您的段落是一个句子列表,这些句子是标记的列表。你的第二个例子。要获得这种格式,可以使用spacy
令牌器
import spacy
# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')
text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]
我不认为在第二句话中需要额外的填充""
,因为sequence_len
会处理这个问题。
更新
据我所知,这将返回两个向量,每个向量代表一个给定的句子。
不,这将返回每个单词的向量,在每个句子中。如果您希望整个段落成为上下文(每个单词),只需将其更改为
sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]
和
...
"sequence_len": [11]
https://stackoverflow.com/questions/53570918
复制相似问题