Python带你看不一样的《青春有你2》小姐姐之评论内容可视化

极简小课

发布于 2022-06-27 14:09:08

57500

代码可运行

文章被收录于专栏：极简小课极简小课

运行总次数：0

代码可运行

本篇我们基于上一篇文章爬取下来的数据进行分析，我们统计词频并可视化，绘制词云

我们使用到的模块：

jieba模块

详细介绍和用法可以去github看：https://github.com/fxsjy/jieba，这里不多做介绍，只介绍本次用到的

1. 分词

jieba.cut 方法接受四个输入参数: 需要分词的字符串；cut_all 参数用来控制是否采用全模式；HMM 参数用来控制是否使用 HMM 模型；use_paddle 参数用来控制是否使用paddle模式下的分词模式，paddle模式采用延迟加载方式，通过enable_paddle接口安装paddlepaddle-tiny，并且import相关代码；

2. 载入词典

开发者可以指定自己自定义的词典，以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力，但是自行添加新词可以保证更高的正确率

用法：jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径

词典格式和 dict.txt 一样，一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。file_name 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。

词频省略时使用自动计算的能保证分出该词的词频。

我们分一下几步来实现：

1. 词频统计并可视化

中文分词：添加新增词，去除停用词

统计top10高频词

可视化展示高频词

2. 绘制词云

根据词频绘制词云

词频统计并可视化

这里主要是利用jieba分词来把评论内容分开，这里主要是创建停用词表，即把一些没有意义的词过滤掉，比如：呵呵，啊等这些词，可以百度一下停用词，会有比较完善的，在根据需求加一些进去，就比较完美了。

jieba分词代码如下：

import jieba  # 中文分词
from collections import Counter
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator  # 绘制词云模块
import paddlehub as hub
import random

def words_split(text):
    """
    利用jieba进行分词
    参数 text:需要分词的句子或文本
    return：分词结果
    """
    stop_file_path = "cn_stopwords.txt"
    sentence_seged = jieba.cut(text.strip())
    stopwords = stop_words_list(stop_file_path)  # 这里加载停用词的路径
    out_lists = []
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                out_lists.append(word)
    return out_lists


def stop_words_list(filepath):
    """
    创建停用词表
    参数 file_path:停用词文本路径
    return：停用词list
    """
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords


def move_stop_words():
    """
    去除停用词,统计词频
    参数 file_path:停用词文本路径 stopwords:停用词list counts: 词频统计结果
    return：None
    """
    jieba.load_userdict("userdict.txt")

    des_file_path = "content.txt"
    inputs = open(des_file_path, 'r')  # 加载要处理的文件的路径
    # outputs = open('output.txt', 'w') #加载处理后的文件路径
    all_words = []
    for line in inputs:
        all_words += words_split(line)  # 这里的返回值是list
        # outputs.write(line_seg)
    # outputs.close()
    inputs.close()
    # WordCount
    with open('output.txt', 'a') as fw:  # 读入已经去除停用词的文件
        for word in all_words:
            fw.write(word + " ")

    counts = Counter(all_words)
    # print(counts)
    data = dict(counts)

    with open('cipin.txt', 'w') as fw:  # 读入存储wordcount的文件路径
        for k, v in data.items():
            fw.write('%s,%d\n' % (k, v))

    return counts

绘制词频top10饼状图

绘制词频top10饼状图：

def draw_counts(counts, num):
    """
    绘制词频统计表
    参数 counts: 词频统计结果 num:绘制topN
    return：none
    """
    top_file = "top_data.json"
    top = counts.most_common(num)
    # print(top)
    dict_top = dict(top)

    word_list = []
    count_list = []
    for key, value in dict_top.items():
        word_list.append(key)
        count_list.append(value)

    # print(word_list)
    # print(count_list)

    # 设置绘图的主题风格
    # plt.style.use('ggplot')
    # 设置显示中文
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 指定默认字体

    # 0.1表示将B那一块凸显出来
    explode = (0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0)
    plt.figure(figsize=(5, 5))

    # 控制x轴和y轴的范围
    plt.xlim(0, 4)
    plt.ylim(0, 4)

    # 将横、纵坐标轴标准化处理，保证饼图是一个正圆，否则为椭圆
    plt.axes(aspect='equal')

    plt.pie(
        count_list, labels=word_list, colors=None, autopct='%.0f%%',
        explode=explode, pctdistance=0.6, shadow=True, labeldistance=1.1, startangle=30, radius=1,
        counterclock=True, wedgeprops=None,
        textprops={'fontsize': 15, 'color': 'k'}, center=(0, 0), frame=False
    )

    plt.title('''《青春有你2》评论热词TOP10''', fontsize=14)
    plt.savefig('pie_hotwords.jpg')
    plt.show()

根据词频绘制词云

我们词频统计，来制作词云图。

def draw_cloud(word_f):
    """
    根据词频绘制词云图
    参数 word_f:统计出的词频结果
    return：none
    """
    # 3、初始化自定义背景图片
    image = Image.open(r'chinaheart.png')
    graph = np.array(image)

    # 4、产生词云图
    # 有自定义背景图：生成词云图由自定义背景图像素大小决定
    wc = WordCloud(font_path=r"simhei.ttf", background_color='white', max_font_size=50, mask=graph)
    wc.generate(word_f)

    # 5、绘制文字的颜色以背景图颜色为参考
    image_color = ImageColorGenerator(graph)  # 从背景图片生成颜色值
    wc.recolor(color_func=image_color)
    wc.to_file(r"word_cloud.png")  # 按照背景图大小保存绘制好的词云图，比下面程序显示更清晰

    # #4、生成词云图，这里需要注意的是WordCloud默认不支持中文，所以这里需已下载好的中文字库
    # #无自定义背景图：需要指定生成词云图的像素大小，默认背景颜色为黑色,统一文字颜色：mode='RGBA'和colormap='pink'
    # wc = WordCloud(font_path=r"simhei.ttf",background_color='white',width=800,height=600,max_font_size=50,
    #             max_words=1000) #,min_font_size=10)#,mode='RGBA',colormap='pink')
    # wc.generate(word_f)
    # wc.to_file(r"wordcloud.png") #按照设置的像素宽高度保存绘制好的词云图，比下面程序显示更清晰

    # 6、显示图片
    plt.figure("词云图")  # 指定所绘图名称
    plt.imshow(wc)  # 以图片的形式显示词云
    plt.axis("off")  # 关闭图像坐标系
    plt.show()

运行主函数

if __name__ == "__main__":
    # 可视化top10热门词
    word_counts = move_stop_words()
    draw_counts(word_counts, 10)

    # 绘制词云图
    with open(r'output.txt', 'r') as f:
        draw_cloud(f.read())

    # 使用hub对评论随机抽取10条进行内容分析
    analyze_results = text_detection(10)
    for result in analyze_results:
        print(result)

运行结果