数据分类：新闻信息自动分类

马拉松程序员

发布于 2023-09-21 19:02:57

38000

代码可运行

文章被收录于专栏：马拉松程序员的专栏马拉松程序员的专栏

运行总次数：0

代码可运行

1.下载并统计新闻数量

数据下载完成后，解压后的文件名news_sohusite_xml.smarty.dat（迷你版），文件编码是用的GBK。

<doc>
<url>页面URL</url>
<docno>页面ID</docno>
<contenttitle>页面标题</contenttitle>
<content>页面内容</content>
</doc>

通过观察页面URL可以发现，搜狐新闻的分类是通过二级域名分类的，比如在当前文件的第一条新闻：

<doc>
<url>http://gongyi.sohu.com/20120706/n347457739.shtml</url>
<docno>98590b972ad2f0ea-34913306c0bb3300</docno>
<contenttitle>深圳地铁将设立ＶＩＰ头等车厢买双倍票可享坐票</contenttitle>
<content>新闻内容</content>
</doc>

当前新闻的类别是“gongyi”，所以这就是当前新闻的类别，我们可以通过获取二级域名的关键词作为当前新闻的类别。在这里因为数据是采用的xml格式，所以使用minidom模块来解析xml数据。URL链接使用urllib模块来解析，使用urlparse()函数解析URL。Hostname的值即是URL连接的域名。

新建第一个Python文件命名为t1_check.py，内容如下：

# -- coding: utf-8 --
import os
from xml.dom import minidom
from urllib.parse import urlparse


# 转换文件编码，并且处理成xml正确的格式
def file_format(from_file, to_file):
    '''
    :param from_file: 待清洗文件
    :param to_file: 清洗完成后的文件
    :return:
    '''
    try:
        # 原文本编码，需要用 gb18030 打开
        with open(from_file, 'r', encoding='gbk') as rf:
            lines = rf.readlines()
        # 原本xml中缺少一个根节点，自定义添加一个
        lines.insert(0, '<word>\n')
        lines.append('</word>')
        with open(to_file, 'w', encoding='utf-8') as wf:
            for l in lines:
                l = l.replace('&', '')
                wf.write(l)
    except UnicodeDecodeError:

        print("转码出错", from_file)


def check_handler(file):
    hostnameType = dict()
    # 使用xml.dom解析xml格式的文本，生成dom对象
    dom = minidom.parse(file)
    root = dom.documentElement
    # 获取xml中的所有doc节点
    docs = root.getElementsByTagName("doc")
    # 遍历每一个doc节点，获取url链接和内容
    i = 0
    List = []
    for doc in docs:

        url = doc.getElementsByTagName("url")[0]
        content = doc.getElementsByTagName("content")[0]
        if content.firstChild is not None and len(content.firstChild.data) >= 50:
            # 解析URL域名，根据域名划分不同的类别
            url = urlparse(url.firstChild.data)
            List.append(url.hostname.split('.')[-3])
            i = i + 1
            print(i)
            type = url.hostname.split('.')[-3]
            if type not in hostnameType.keys():
                hostnameType[type] = 1
            else:
                hostnameType[type] = hostnameType[type] + 1
    return hostnameType


if __name__ == '__main__':
    file = 'news_sohusite_xml.dat'
    # 数据清洗
    #  file_format(file, file)
    print(check_handler(file))

代码运行结果：{'gongyi': 192, '2008': 786, 'baobao': 2306, 'women': 5073, 's': 4985, 'roll': 658640, 'auto': 92528, 'learning': 10193, 'book': 3571, 'travel': 1857, 'yule': 47627, 'sports': 40203, 'chihe': 489, 'gd': 1298, 'business': 21083, 'money': 9442, 'news': 76792, 'stock': 44083, 'sh': 1220, 'it': 152691, 'club': 238, 'tv': 808, 'games': 9, 'cul': 1496, 'tuan': 1, '2012': 28, 'health': 21703, 'astro': 323, '2010': 514, 'dm': 26, 'green': 420, 'goabroad': 910, 'men': 944, 'korea': 105, 'v': 6, 'fund': 4284, 'expo2010': 1, 'media': 532, 'bschool': 129}

统计news_sohusite_xml.dat文件中共有39个类别1207536条新闻，不过有的新闻数量太少，我们将其剔除。为了能提高差异化，在这些类别中选择条数多的而且特点明显的作为训练数据。

经过筛选，我们选择如下八个类别的新闻作为训练和测试集：分别是｛'women': 5073, 'learning': 10193,'book': 3571, 'yule': 47627,'sports': 40203, 'business': 21083,'it': 152691,'health': 21703}。这些类别的数量足够多，其次特点都相对的明显一些，虽然'roll'的分类高达658640条，但是从字面意思上看这是滚动新闻，可能是当时采集数据的那一个月的头条新闻，头条新闻类别并不明显，并不利于做分类训练。'stock'分类下也有44083条数据，但是在一定程度上，证券（包括股票基金等等）也算是商业的一部分，跟'business'分类有点重叠，'business'可以涵盖'stock'，这样的作为训练数据，效果并不理想。综合评估下来，我们选择上面8个分类作为数据集。

2.获取训练测试集

前面我们选定了8个类别的新闻，现在的需要把原始文件的内容拆分出来一部分，作为数据集，考虑到每个类别数据差距比较大，选择一个平衡的数量，每个类别取3000条数据。并将每条新闻的内容（content标签下的内容）单独保存到一个txt文件中。

在步骤2，新建第一个Python文件命名为t2_parse.py，内容如下：

# -- coding: utf-8 --
import os
from xml.dom import minidom
from urllib.parse import urlparse


def praser_handler(file):
    hostnameType = {'women': 0, 'learning': 0, 'yule': 0, 'sports': 0, 'business': 0, 'book': 0, 'it': 0, 'health': 0}
    # 使用xml.dom解析xml格式的文本
    dom = minidom.parse(file)
    root = dom.documentElement
    # 获取xml中的所有doc节点
    docs = root.getElementsByTagName("doc")
    # 遍历每一个doc节点，获取url链接和内容
    i = 0
    for doc in docs:
        url = doc.getElementsByTagName("url")[0]
        content = doc.getElementsByTagName("content")[0]
        if content.firstChild is not None and len(content.firstChild.data) >= 50:
            # 解析URL域名，根据域名划分不同的类别
            url = urlparse(url.firstChild.data)
            i = i + 1
            type = url.hostname.split('.')[-3]
            if type in hostnameType.keys():
                # 根据域名区分新闻类别，并且保存到相应文件夹下，如果没有文件夹则新建一个
                if not os.path.exists(os.path.join("news", type)):
                    os.makedirs(os.path.join("news", type))
                # 每个类别包含3000条新闻，如果没有达到数量则保存
                if hostnameType[type] < 3000:
                    hostnameType[type] = hostnameType[type] + 1
                    # 保存名称为类别名称+序号
                    news_in = open(os.path.join("news", type) + "/%s_%d.txt" % (type, i), "wb")
                    news_in.write(content.firstChild.data.encode('utf8'))
                    news_in.close()


if __name__ == '__main__':
    praser_handler('news_sohusite_xml.dat')

代码运行结果如图所示，我们已经拆分出来8个类别的新闻，每个文件夹下有3000个txt文本，内容是每个类别下的新闻内容。文件夹的位置与t2_parse.py是同一级目录。

3.拆分训练测试集

上面得到的3000*8条新闻是接下来用于训练和测试的语料库。通常情况下，我们用一部分数据去训练，剩余的一部分去测试，现在我们选取每个类别的前80%作为训练数据集，后20%作为测试数据集。当然这个比例并不是固定的，可以选择70%训练，30%测试，但是训练数据一定是要多的，保证模型的效果。

不过共同只有训练数据不足2万条，相比于正常的模型训练来说，数据量偏小，不过做作为一个例子来做还算是可以，现在预料下，我们的模型准确率可能并不是特别高。

在步骤3，新建第一个Python文件命名为t3_manual.py，内容如下：


# -*- coding: utf-8 -*-
import glob
import os
import random
import shutil
from threading import Thread
from queue import Queue


# 检查目录是否存在，不存在则创建
def check_dir_exist(dir):
    if not os.path.exists(dir):
        os.mkdir(dir)


# 复制文件到新目录下
def copyfile(q):
    while not q.empty():
        full_folder, train, test, divodd = q.get()
        files = glob.glob(full_folder)
        filenum = len(files)
        testnum = int(filenum * divodd)
        # 生成3000以内的600个随机数
        testls = random.sample(list(range(filenum)), testnum)
        for i in range(filenum):
            # 如果序号是在testls中的放到测试集，否则放到训练集.
            if i in testls:
                shutil.copy(files[i], os.path.join(test, os.path.basename(files[i])))
            else:
                shutil.copy(files[i], os.path.join(train, os.path.basename(files[i])))


def data_divi(from_dir, to_dir, testProp=0.2):
    '''
    :param from_dir: 来源文件夹
    :param to_dir:  拆分后的文件夹
    :param testProp:   测试拆分比例，默认0.2
    :return:
    '''
    train_folder = os.path.join(to_dir, "train")
    test_folder = os.path.join(to_dir, "test")
    check_dir_exist(train_folder)
    check_dir_exist(test_folder)
    q = Queue()
    for basefolder in os.listdir(from_dir):
        full_folder = os.path.join(from_dir, basefolder)
        train = os.path.join(train_folder, basefolder)
        check_dir_exist(train)
        test = os.path.join(test_folder, basefolder)
        check_dir_exist(test)
        full_folder += "/*.txt"
        q.put((full_folder, train, test, testProp))
    for i in range(8):
        Thread(target=copyfile, args=(q,)).start()


if __name__ == "__main__":
    corpus_dir = 'news'
    exp_path = 'news2'
    testProp = 0.2
    data_divi(corpus_dir, exp_path, testProp)
    print("拆分完成")

代码运行结束后，拆分结果如图所示。

数据集拆分完成后，在当前目录新创建了“news2”文件夹，PyCharm需要加载当前目录下的文件，会花费一点时间，如果电脑性能不足的情况下，可以手动把“news”文件夹删除，后面的训练和测试都从news2中读取数据。

4. 特征提取

完成数据集拆分后，下一步就是特征提取，在9.2节中提到了两个常用的方法，本次示例就使用TF-IDF模型来提取特征向量。

你可能会比较好奇或者疑问，难道不需要分词和去除停用词吗？确实是需要，不过这两个部分，将会在特征提取之前，制作语料库的时候处理。

在开始之前需要认识两个新模块，一个是pickle，一个是bunch。两个模块的功能很简单也很容易理解。先用pip将两个模块安装到本机中，安装命令分别是

pip install pickle
pip install bunch

pickle提供了一个简单的持久化功能，可以将对象以文件的形式存放在磁盘上，目的就是将提取完成的特征保存为磁盘中，以方便后面的分类器来训练使用，这样就不需要每训练一次就提取一次特征，省时省力。Python中几乎所有的数据类型（列表，字典，集合，类等）都可以用pickle来序列化。pickle主要方法有如下两个：pickle.dump(obj, file)的作用是序列化对象，并将结果数据流写入到文件对象中。pickle.load(file)的作用是反序列化对象，将文件中的数据解析为一个Python对象。

Bunch()是一个提供属性样式访问的字典。它继承dict，拥有字典所有的功能。

from bunch import Bunch
# 创建Bunch对象
a = Bunch()
# 设置属性
a.hello = 'world'
a.foo = Bunch(bun=True)
# 访问属性
print(a.hello)
print(a.foo.bun)
#代码输出结果：
world
True

熟悉这两个模块之后，现在开始做特征提取，与之前小节的方法一样，不过为了方便，需要先将所有的语料库，按照每个类别放到一起，得到一个类别的文本库。

在步骤4，新建第一个Python文件命名为t4_tfidf_space.py，内容如下：

# -- coding: utf-8 --

import jieba
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from bunch import Bunch
from concurrent import futures
import sys
import pickle


def readfile(filepath, encoding='utf-8'):
    # 读取文本
    with open(filepath, "rt", encoding=encoding) as fp:
        content = fp.read()
    return content


def writeObject(path, obj):
    # 持久化python对象
    with open(path, "wb") as file_obj:
        pickle.dump(obj, file_obj)


def check_dir_exist(dir):
    # 检查目录是否存在，不存在则创建
    if not os.path.exists(dir):
        os.mkdir(dir)


def folder_handler(args):
    """遍历一个文件夹下的文本"""
    folder, encoding, seg = args
    print('遍历：', folder)
    try:
        assert os.path.isdir(folder)
    except AssertionError:
        return None
    files = os.listdir(folder)
    content = []
    filenames = []
    for name in files:
        filepath = os.path.join(folder, name)
        text = readfile(filepath, encoding)
        # 在此可直接分词
        if seg:
            text = ' '.join(jieba.cut(text))
        content.append(text)
        filenames.append(filepath)
    return filenames, content


def corpus_bunch(data_dir, encoding='utf-8', seg=True):
    """
    得到文本库，返回一个 Bunch 对象
    :param data_dir:    文本库目录，目录下以文件归类 data_dir/category/1.txt
    :param encoding:    文本库编码
    :param seg:         是否需要分词
    :return:
    返回一个Bunch对象包含以下属性：
     filenames (list) ：[19200个文件名（字符串，包含news2开头的路径）]
     label (list) ：[19200个标签（字符串，取文本文件所在的文件名）]
     contents (list) ：[19200个文本的内容（字符串）]
    """
    try:
        assert os.path.isdir(data_dir)
    except AssertionError:
        print('{} is not a folder!')
        sys.exit(0)

    corpus = Bunch(filenames=[], label=[], contents=[])
    # 获得每个文件夹的目录
    folders = [os.path.join(data_dir, d) for d in os.listdir(data_dir)]
    # 创建线程池遍历二级目录
    with futures.ThreadPoolExecutor(max_workers=len(folders)) as executor:
        folders_executor = {executor.submit(folder_handler, (folder, encoding, seg)): folder for folder in folders}
        for fol_exe in futures.as_completed(folders_executor):
            folder = folders_executor[fol_exe]
            filenames, content = fol_exe.result()
            if content:
                cat_name = folder.split('\\')[-1]
                content_num = len(content)
                print(cat_name, content_num, sep=': ')
                label = [cat_name] * content_num
                corpus.filenames.extend(filenames)
                corpus.label.extend(label)
                corpus.contents.extend(content)
    return corpus


def vector_space(corpus_dir, stop_words=None, vocabulary=None, encoding='utf-8', seg=True):
    """
    将一个语料库向量化
    """
    vectorizer = TfidfVectorizer(stop_words=stop_words, vocabulary=vocabulary)
    # 得到文本库
    corpus = corpus_bunch(corpus_dir, encoding=encoding, seg=seg)
    tfidf_bunch = Bunch(filenames=corpus.filenames, label=corpus.label, tdm=[], vocabulary={})
    # 计算TF-IDF向量
    tfidf_bunch.tdm = vectorizer.fit_transform(corpus.contents)
    # 语料库词汇表
    tfidf_bunch.vocabulary = vectorizer.vocabulary_
    return tfidf_bunch


def tfidf_space(data_dir, save_path, stopword_path=None, encoding='utf-8', seg=True):
    '''
    获取语料库特征向量并将其持久化
    :param data_dir: 数据所在文件夹
    :param save_path:持久化对象保存目录
    :param stopword_path: 加载停用词文件路径
    :param encoding: 字符编码。默认UTF-8
    :param seg: 是否分词，默认是True
    :return:
    '''
    stopWord = None
    # 加载停用词，使用baidu_stopwords.txt
    # 如果需要添加自定义停用词，可在此文件中直接补充
    if stopword_path:
        stopWord = [wd.strip() for wd in readfile(stopword_path).splitlines()]
    check_dir_exist(save_path)
    train = data_dir + '/train'
    # 向量化训练集文本
    train_tfidf = vector_space(train, stop_words=stopWord, encoding=encoding, seg=seg)
    test = data_dir + '/test'
    # 向量化测试集文本
    test_tfidf = vector_space(test, stop_words=stopWord, vocabulary=train_tfidf.vocabulary, encoding=encoding, seg=seg)
    # 持久化训练集特征向量
    writeObject(os.path.join(save_path, 'train_tfidf.data'), train_tfidf)
    # 持久化测试集特征向量
    writeObject(os.path.join(save_path, 'test_tfidf.data'), test_tfidf)
    # 持久化训练集词语库（词频）
    writeObject(os.path.join(save_path, 'vocabulary.data'), train_tfidf.vocabulary)


if __name__ == '__main__':
    # 数据集目录
    data_dir = 'news2'
    # 持久化特征向量存放目录
    feature_space = 'feature_space'
    # 构建tfidf特征向量
    tfidf_space(data_dir, 'feature_space', stopword_path='baidu_stopwords.txt', seg=True)
    print("done")

本步操作并不是很复杂，但是为了更加使代码结构化，对各个方法进行相应的封装，代码比较长，不过最终得到结果比较简单，因为持久化的对象无法直接查看，在PyCharm 中使用断点的方式查看下train_tfidf的对象的内容

代码运行大概2-3分钟，根据机器的性能的速度而定，运行完成后，train_tfidf和test_tfidf对象以及训练集的词语集将在当前目录的feature_space文件夹中持久化。这三个文件将为后面的模型训练提供基础，在制作分类器的时候，只需要将它们加载到内存中即可。

5.制作通用分类器

到现在为止，文本分类的前期已经准备完成了，下面就是训练模型并且制作分类器。为了方便比较各个分类算法之前的性能差异，所以现在我们制作一个通用的分类器，接收分类算法、训练集数据、测试集数据，如果当前分类算法从未训练过模型，那么先进行模型训练，并将训练完成的模型持久化保存，方便下次使用。如果当前分类算法已经进行过了模型训练，则直接加载持久化的对象进行测试或者预测。

因为在代码的多处都使用了对象持久化和加载，所以将这些方法封装成一个tools工具类，新建一个tools.py,内容如下：

import os
import pickle
def readfile(filepath, encoding='utf-8'):
# 读取文本
with open(filepath, "rt", encoding=encoding) as fp:
content = fp.read()
return content


def savefile(savepath, content, encoding='utf-8'):
# 保存文本
with open(savepath, "wt", encoding=encoding) as fp:
fp.write(content)


def writeObject(path, obj):
# 持久化python对象
with open(path, "wb") as file_obj:
pickle.dump(obj, file_obj)


def readObject(path):
# 载入python对象
with open(path, "rb") as file_obj:
obj = pickle.load(file_obj)
return obj


def check_dir_exist(dir):
# 检查目录是否存在，不存在则创建
if not os.path.exists(dir):
os.mkdir(dir)


def stopWords(stopword_path="baidu_stopwords.txt"):
# 返回停用词列表，默认使用百度停用词表
stopWords = [wd.strip() for wd in readfile(stopword_path).splitlines()]
return stopWords

在步骤5，新建第一个Python文件命名为t5_classifier.py，内容如下：

"""
文本分类
实现读取文本，实现分词，构建词袋，保存分词后的词袋。
提取 tfidf 特征，保存提取的特征
"""
import os
import jieba
import joblib
from sklearn import metrics
import text.tools as tools
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

# 文本分类器
class TextClassifier:
    def __init__(self, clf_model, data_dir, model_path):
        """
        分类器
        :param clf_model:   分类器算法
        :param data_dir:    特征数据存放目录
        :param model_path:  模型保存路径
        """
        self.data_dir = data_dir
        self.model_path = model_path
        self.train_data = os.path.join(data_dir, 'train_tfidf.data')
        self.test_data = os.path.join(data_dir, 'test_tfidf.data')
        self.vocabulary_data = os.path.join(data_dir, 'vocabulary.data')
        self.clf = self._load_clf_model(clf_model)

    def _load_clf_model(self, clf_model):
        '''
        加载模型，如果没有则训练模型，并且保存到相应文件中
        :param clf_model:
        :return:
        '''
        if os.path.exists(self.model_path):
            print('loading exists models...')
            return joblib.load(self.model_path)
        else:
            print('training models...')
            train_set = tools.readObject(self.train_data)
            clf = clf_model.fit(train_set.tdm, train_set.label)
            joblib.dump(clf, self.model_path)
            return clf

    def _predict(self, tdm):
        """
        :param tdm: 加载特征矩阵
        :return:
        """
        return self.clf.predict(tdm)

    def validation(self):
        """使用测试集进行模型验证"""
        print('starting validation...')
        test_set = tools.readObject(self.test_data)
        predicted = self._predict(test_set.tdm)
        actual = test_set.label
        for flabel, file_name, expct_cate in zip(actual, test_set.filenames, predicted):
            if flabel != expct_cate:
                pass
                #print(file_name, ": 实际类别:", flabel, " --> 预测类别:", expct_cate)
        print('准确率: {0:.3f}'.format(metrics.precision_score(actual, predicted, average='weighted')))
        print('召回率: {0:0.3f}'.format(metrics.recall_score(actual, predicted, average='weighted')))
        print('f1-score: {0:.3f}'.format(metrics.f1_score(actual, predicted, average='weighted')))

    def predict(self, text_string=None):
        '''
        使用模型新文本类型
        :param text_string: 新闻文本
        :return: 预测的类别
        '''
        vocabulary = tools.readObject(self.vocabulary_data)
        if text_string:
            corpus = [' '.join(jieba.cut(text_string))]
            vectorizer = TfidfVectorizer(vocabulary=vocabulary, stop_words=tools.stopWords())
            tdm = vectorizer.fit_transform(corpus)
            return self._predict(tdm)
        else:
            return None

编写完成分类器后，就可以实现自动文本分类了吗？并没有，还需要我们进行相应的模型训练。

6.评估和验证模型

“万事俱备，只欠东风”，分类器已经完成，现在需要将模型训练出来，就可以进行自动化的分类了。为了对比多个不同分类算法的性能差异，这里我们选择了4个分类算法进行训练，分别是朴素贝叶斯、逻辑回归、随机森林和支持向量机算法。

步骤6，新建第一个Python文件命名为t6_assess.py，内容如下：

#导入sklearn中相关算法包
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# 导入第5步骤编写的分类器
from text.t5_classifier import TextClassifier

# 多项式朴素贝叶斯分类器
def Multinomial(data_dir, model_dir):
    clf = MultinomialNB()
    model_path = model_dir + '/NBclassifier.pkl'
    classifier = TextClassifier(clf, data_dir, model_path)
    return classifier

# 逻辑回归分类器
def Logistic(data_dir, model_dir):
    clf = LogisticRegression()
    model_path = model_dir + '/LRclassifier.pkl'
    classifier = TextClassifier(clf, data_dir, model_path)
    return classifier

# 随机森林分类器
def RandomForest(data_dir, model_dir):
    clf = RandomForestClassifier()
    model_path = model_dir + '/Radfclassifier.pkl'
    classifier = TextClassifier(clf, data_dir, model_path)
    return classifier

# 支持向量机分类器
def svm(data_dir, model_dir):
    clf = SVC()
    model_path = model_dir + '/SVMclassifier.pkl'
    classifier = TextClassifier(clf, data_dir, model_path)
    return classifier

if __name__ == '__main__':
    from time import *
    model_dir = 'models'  # 模型存放目录
    data_dir = 'feature_space'  # 特征数据存放目录
    #统计各个模型训练时间和性能指标
    start = time()
    classifier1 = Multinomial(data_dir, model_dir)
    end = time()
    print('多项式朴素贝叶斯分类器: %s Seconds' % ('{:.3f}'.format(end - start)))
    classifier1.validation()

    start = time()
    classifier2 = Logistic(data_dir, model_dir)
    end = time()
    print('逻辑回归分类器: %s Seconds' % ('{:.3f}'.format(end - start)))
    classifier2.validation()

    start = time()
    classifier3 = RandomForest(data_dir, model_dir)
    end = time()
    print('随机森林分类器: %s Seconds' % ('{:.3f}'.format(end - start)))
    classifier3.validation()

    start = time()
    classifier4 = svm(data_dir, model_dir)
    end = time()
    print('支持向量机分类器: %s Seconds' % ('{:.3f}'.format(end - start)))
    classifier4.validation()

上面四个分类算法，有的运行比较快，有的会很长，经过几分钟的运行后，得到如下结果：

training models...
多项式朴素贝叶斯分类器: 0.147 Seconds
starting validation...
准确率: 0.940
召回率: 0.935
f1-score: 0.934
training models...
逻辑回归分类器: 17.944 Seconds
starting validation...
准确率: 0.950
召回率: 0.949
f1-score: 0.949
training models...
随机森林分类器: 29.343 Seconds
starting validation...
准确率: 0.938
召回率: 0.933
f1-score: 0.932
training models...
支持向量机分类器: 164.987 Seconds
starting validation...
准确率: 0.953
召回率: 0.952
f1-score: 0.952

日志看起来并不方便比对，将各个分类器性能对比整理成本表格。

分类器	训练时间（s）	准确率	召回率	F1-score
朴素贝叶斯	0.147	0.940	0.935	0.934
逻辑回归	17.944	0.950	0.949	0.949
随机森林	29.343	0.938	0.933	0.932
支持向量机	164.987	0.953	0.952	0.952

由上表所示，朴素贝叶斯分类器的速度非常快，可以达到毫秒级的训练时间，可以轻松应对于大数量的语料库。而支持向量机分类器虽然在性能上比较好，但是训练时间太长。这还是仅仅在不到2万条的数据量而且并没有设置各项参数下进行的。

那么现在的模型对于预测其他时间段的新闻是否可以呢，这里从网上找了21年时间的几条新闻验证一下。直接使用t6_assess.py中的多项式朴素贝叶斯来验证。

if __name__ == '__main__':    #直接修改main方法
model_dir = 'models'  # 模型存放目录
    data_dir = 'feature_space'  # 特征数据存放目录
    classifier1 = Multinomial(data_dir, model_dir)

    print("多项式朴素贝叶斯分类器预测新内容")

    # 预测新内容，新闻1实际类别：体育
    text_string1 = '北京时间4月18日，2020-21赛季CBA季后赛继续进行，北京男篮迎战广东男篮。全场比赛，两队共出现85次罚球，也领到了66次犯规，赵睿也在末节犯满离场。最终，广东男篮以104-103险胜北京男篮，成功晋级4强。'
    ret1 = classifier1.predict(text_string=text_string1)
    print("新闻1类别：" + ret1[0])

    #新闻2实际类别：健康类
    text_string2 = '在日常生活当中，一定要做好预防高血压的工作。导致患上高血压的原因比较多，但与饮食不当有着非常重要的关系。'
    ret2 = classifier1.predict(text_string=text_string2)
    print("新闻2类别：" + ret2[0])
    #新闻3实际类别：财经商业类
    text_string3 = '央视财经客户端4月17日报道，近日，《经济半小时》栏目接到了多地群众的反映，一些地方上的农田水利设施，派不上用场，要么是半拉子工程、要么就是摆设。农民着急等着水浇地，而这些国家投入资金修建的水利设施，却成了农民心头最窝火的事情。'
    ret3 = classifier1.predict(text_string=text_string3)
    print("新闻3类别：" + ret3[0])

    #新闻4实际类别：学习类
    text_string4 = '在不少海外高校，学生在学期初选课的时候，可以在两种计分方式中选择——一种为“字母分数”，使用A到F的字母打分，另一种为“及格/不及格”，这种成绩计算方式最终只有及格和不及格两个选项。对于不同的课程和学习计划，两种计分方式各有利弊，选择适合自己的方式，对于安排学习和未来发展路线都有影响。'
    ret4 = classifier1.predict(text_string=text_string4)
    print("新闻4类别：" + ret4[0])

#代码运行结果：
loading exists models...
多项式朴素贝叶斯分类器预测新内容
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\liudr\AppData\Local\Temp\jieba.cache
Loading model cost 0.575 seconds.
Prefix dict has been built successfully.
新闻1类别：sports
新闻2类别：health
新闻3类别：business
新闻4类别：business

从网上找了四段新闻内容，分别为体育、健康、财经、学习类的新闻，当前的多项式朴素贝叶斯分类器预测准确了3个。

我们使用的训练集是2012年的新闻，虽然距今大约10年，但是一些新闻类的词汇还是可以通用的。因为我们训练模型使用的每个类别的训练集只有8000条，数据量不足以覆盖的全面，除此之外，我们并不没有进行参数调试，仅仅做了实际演示和基础入门的了解，在实际生产中，肯定不会训练一次就可以完成最终的模型。

在上面举例子的四个算法中，每个算法都有其选择调节的参数，感兴趣的同学可以进行下参数调优以及加大训练集的数据量，我就不再赘述了。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2023-09-21 12:00，如有侵权请联系 cloudcommunity@tencent.com 删除

path