前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Python NLTK 情感分析不正确

Python NLTK 情感分析不正确

原创
作者头像
用户11021319
发布2024-07-29 11:02:00
820
发布2024-07-29 11:02:00

1、问题背景

一位 Reddit 用户使用 Python 的 NLTK 库来训练一个朴素贝叶斯分类器以研究其他句子的情感,但是无论输入什么句子,分类器总是预测为正面。

2、解决方案

经过仔细检查,发现原始代码中的问题在于 wordList 为空。因此,需要将 wordList 赋值为从推文中提取的单词特征。修改后的代码如下:

代码语言:javascript
复制
wordList = getwordfeatures(getwords(tweets))
wordList = [i for i in wordList if not i in stopwords.words('english')]
wordList = [i for i in wordList if not i in customstopwords]

以下是完整的修复代码:

代码语言:python
代码运行次数:0
复制
import nltk
import math
import re
import sys
import os
import codecs
reload(sys)
sys.setdefaultencoding('utf-8')

from nltk.corpus import stopwords

__location__ = os.path.realpath(
    os.path.join(os.getcwd(), os.path.dirname(__file__)))

postweet = __location__ + "/postweet.txt"
negtweet = __location__ + "/negtweet.txt"

customstopwords = ['band', 'they', 'them']

# Load positive tweets into a list
p = open(postweet, 'r')
postxt = p.readlines()

# Load negative tweets into a list
n = open(negtweet, 'r')
negtxt = n.readlines()

neglist = []
poslist = []

# Create a list of 'negatives' with the exact length of our negative tweet list.
for i in range(0, len(negtxt)):
    neglist.append('negative')

# Likewise for positive.
for i in range(0, len(postxt)):
    poslist.append('positive')

# Creates a list of tuples, with sentiment tagged.
postagged = zip(postxt, poslist)
negtagged = zip(negtxt, neglist)

# Combines all of the tagged tweets to one large list.
taggedtweets = postagged + negtagged

tweets = []

# Create a list of words in the tweet, within a tuple.
for (word, sentiment) in taggedtweets:
    word_filter = [i.lower() for i in word.split()]
    tweets.append((word_filter, sentiment))

# Pull out all of the words in a list of tagged tweets, formatted in tuples.
def getwords(tweets):
    allwords = []
    for (words, sentiment) in tweets:
        allwords.extend(words)
    return allwords

# Order a list of tweets by their frequency.
def getwordfeatures(listoftweets):
    # Print out wordfreq if you want to have a look at the individual counts of words.
    wordfreq = nltk.FreqDist(listoftweets)
    words = wordfreq.keys()
    return words

# Calls above functions - gives us list of the words in the tweets, ordered by freq.
print(getwordfeatures(getwords(tweets)))

wordList = getwordfeatures(getwords(tweets))
wordList = [i for i in wordList if not i in stopwords.words('english')]
wordList = [i for i in wordList if not i in customstopwords]

def feature_extractor(doc):
    docwords = set(doc)
    features = {}
    for i in wordList:
        features['contains(%s)' % i] = (i in docwords)
    return features

# Creates a training set - classifier learns distribution of true/falses in the input.
training_set = nltk.classify.apply_features(feature_extractor, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)

print(classifier.show_most_informative_features(n=30))

while True:
    input = raw_input('ads')
    if input == 'exit':
        break
    elif input == 'informfeatures':
        print(classifier.show_most_informative_features(n=30))
        continue
    else:
        input = input.lower()
        input = input.split()
        print('\nWe think that the sentiment was ' + classifier.classify(feature_extractor(input)) + ' in that sentence.\n')

p.close()
n.close()

用户可以根据需要调整 customstopwords 列表以过滤掉不相关的词语。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1、问题背景
  • 2、解决方案
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档