我有一个白鲸迪克语料库,我需要计算双字“象牙腿”的概率。我知道这个命令给我列出了所有的表格
bigrams = [w1+" "+w2 for w1,w2 in zip(words[:-1], words[1:])]但是,我如何才能得到这两个词的概率呢?
发布于 2020-07-13 01:00:51
你可以数数所有的比例尺和具体的比例尺,你正在寻找。双图发生P( bigram )的概率与它们的商有关。word1给出单词P的条件概率P(w1 \ w)是双字数在w数上出现次数的商数。
s = 'this is some text about some text but not some other stuff'.split()
bigrams = [(s1, s2) for s1, s2 in zip(s, s[1:])]
# [('this', 'is'),
# ('is', 'some'),
# ('some', 'text'),
# ('text', 'about'),
# ...
number_of_bigrams = len(bigrams)
# 11
# how many times 'some' occurs
some_count = s.count('some')
# 3
# how many times bigram occurs
bg_count = bigrams.count(('some', 'text'))
# 2
# probabily of 'text' given 'some' P(bigram | some)
# i.e. you found `some`, what's the probability that its' makes the bigram:
bg_count/some_count
# 0.666
# probabilty of bigram in text P(some text)
# i.e. pick a bigram at random, what's the probability it's your bigram:
bg_count/number_of_bigrams
# 0.181818https://stackoverflow.com/questions/62867820
复制相似问题