
在自然语言处理(NLP)领域,如何将非结构化的文本数据转换为计算机可以处理的数值形式,是一个基础性的关键问题。词袋模型(Bag of Words, BoW)作为一种简单直接的文本表示方法,虽然能够捕获文本中的词频信息,但无法区分不同词的重要性。为了解决这个问题,TF-IDF(Term Frequency-Inverse Document Frequency)向量表示方法应运而生,它通过评估词语对文档集合中某个文档的重要程度,为文本分析提供了更准确的数值表示。
TF-IDF是信息检索和文本挖掘领域中最经典且广泛使用的特征提取技术之一。自20世纪70年代提出以来,TF-IDF在搜索引擎排序、文档分类、文本聚类、相似度计算等众多任务中展现出了强大的应用价值。尽管随着词嵌入(Word Embedding)和预训练语言模型等新技术的兴起,TF-IDF不再是最前沿的文本表示方法,但其简洁性、可解释性和在特定场景下的有效性使其仍然是NLP工具箱中不可或缺的一部分。
本文将深入探讨TF-IDF向量表示的基本原理、实现方法、优化策略以及实际应用。我们将从理论到实践,通过丰富的代码示例和应用案例,全面介绍TF-IDF技术,并探讨其与现代文本表示方法的比较与融合。无论您是NLP领域的初学者,还是希望深化理解经典文本表示技术的研究人员,本文都将为您提供有价值的知识和实用的指导。
TF-IDF的核心思想是:一个词语对一篇文档的重要性与其在该文档中出现的频率成正比,与其在整个文档集合中出现的频率成反比。这一思想基于以下假设:
TF-IDF正是通过结合这两个维度的信息,为每个词分配一个权重,从而更好地表示文本内容的重要特征。
词频(Term Frequency)表示一个词在文档中出现的频率,计算方法如下:
TF(t, d) = (词t在文档d中出现的次数) / (文档d的总词数)TF的计算可以有多种变体:
逆文档频率(Inverse Document Frequency)表示一个词在整个文档集合中的罕见程度,计算方法如下:
IDF(t, D) = log(文档集合D中的总文档数 / (包含词t的文档数 + 1))其中,分母加1是为了避免除以零的情况(当一个词在所有文档中都不出现时)。
IDF的计算也有一些常见变体:
TF-IDF权重是TF和IDF的乘积,表示一个词对文档的重要性:
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)通过这个公式,我们可以为文档中的每个词分配一个权重,然后将文档表示为一个向量,其中每个维度对应一个词的TF-IDF权重。
从数学角度来看,TF-IDF可以看作是对词袋模型的一种信息增益调整。它通过IDF因子降低了常见词的权重,提高了稀有但具有区分度的词的权重。这种处理使得文本表示更加注重那些能够反映文档独特性的词语,从而提高后续文本分析任务的效果。
在信息论中,IDF可以理解为一种信息增益的度量。一个词的IDF值越高,表示它在区分不同文档方面提供的信息量越大。而TF则反映了这个信息在当前文档中的重要程度。两者的结合使得TF-IDF能够有效地捕获文本的关键特征。
理解TF-IDF的最佳方式是手动实现它。下面我们将使用Python代码展示TF-IDF的手动计算过程:
import math
from collections import Counter
import numpy as np
def preprocess_text(text):
"""简单的文本预处理:转小写、分词、移除标点符号"""
# 转小写
text = text.lower()
# 移除标点符号
import re
text = re.sub(r'[\W]+', ' ', text)
# 分词
words = text.split()
# 移除停用词(可选)
stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'of'}
words = [word for word in words if word not in stop_words]
return words
def compute_tf(text, word_set):
"""计算词频(TF)"""
words = preprocess_text(text)
word_counts = Counter(words)
total_words = len(words)
tf_dict = {}
for word in word_set:
tf_dict[word] = word_counts.get(word, 0) / total_words if total_words > 0 else 0
return tf_dict
def compute_idf(documents, word_set):
"""计算逆文档频率(IDF)"""
total_docs = len(documents)
idf_dict = {}
# 统计每个词在多少文档中出现
doc_count = {word: 0 for word in word_set}
for doc in documents:
words = set(preprocess_text(doc))
for word in words:
if word in doc_count:
doc_count[word] += 1
# 计算IDF
for word in word_set:
idf_dict[word] = math.log(total_docs / (doc_count.get(word, 0) + 1))
return idf_dict
def compute_tfidf(tf, idf):
"""计算TF-IDF权重"""
tfidf = {}
for word in tf:
tfidf[word] = tf[word] * idf[word]
return tfidf
def main():
# 示例文档集合
documents = [
"TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.",
"The TF-IDF value increases proportionally to the number of times a word appears in the document.",
"TF-IDF is often used as a weighting factor in searches of information retrieval, text mining, and user modeling."
]
# 构建词汇表
all_words = set()
for doc in documents:
all_words.update(preprocess_text(doc))
word_set = sorted(list(all_words))
print("词汇表:", word_set)
print("\n每篇文档的TF-IDF表示:")
# 计算IDF
idf_values = compute_idf(documents, word_set)
# 为每篇文档计算TF-IDF
for i, doc in enumerate(documents):
tf_values = compute_tf(doc, word_set)
tfidf_values = compute_tfidf(tf_values, idf_values)
print(f"\n文档 {i+1}:")
# 打印前5个非零TF-IDF值(按权重排序)
sorted_tfidf = sorted(tfidf_values.items(), key=lambda x: x[1], reverse=True)
for word, value in sorted_tfidf[:5]:
if value > 0:
print(f" {word}: {value:.4f}")
if __name__ == "__main__":
main()这段代码展示了TF-IDF计算的完整流程,包括文本预处理、TF计算、IDF计算和最终的TF-IDF权重计算。通过手动实现,我们可以更深入地理解TF-IDF的工作原理。
在实际应用中,我们通常使用现有的库来实现TF-IDF,这样可以提高效率并避免重复造轮子。scikit-learn是Python中常用的机器学习库,提供了高效的TF-IDF实现:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
def tfidf_with_sklearn():
# 示例文档集合
documents = [
"TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.",
"The TF-IDF value increases proportionally to the number of times a word appears in the document.",
"TF-IDF is often used as a weighting factor in searches of information retrieval, text mining, and user modeling."
]
# 初始化TF-IDF向量化器
vectorizer = TfidfVectorizer(stop_words='english')
# 拟合并转换文档
tfidf_matrix = vectorizer.fit_transform(documents)
# 获取词汇表
feature_names = vectorizer.get_feature_names_out()
# 转换为DataFrame以便查看
tfidf_df = pd.DataFrame(
data=tfidf_matrix.toarray(),
columns=feature_names,
index=[f"文档 {i+1}" for i in range(len(documents))]
)
print("词汇表:", feature_names)
print("\nTF-IDF矩阵:")
print(tfidf_df)
# 查看每篇文档的前5个重要词
for i, doc in enumerate(documents):
print(f"\n文档 {i+1} 的前5个重要词:")
doc_tfidf = dict(zip(feature_names, tfidf_matrix.toarray()[i]))
sorted_words = sorted(doc_tfidf.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:5]:
print(f" {word}: {score:.4f}")
tfidf_with_sklearn()scikit-learn的TfidfVectorizer提供了丰富的参数选项,可以根据需要进行配置:
stop_words:指定停用词列表或使用预定义的停用词集ngram_range:指定要考虑的n-gram范围,如(1, 2)表示同时考虑单词和双词组合max_df/min_df:设置词频阈值,过滤掉过于常见或过于罕见的词max_features:限制词汇表的最大大小use_idf:是否使用IDF加权smooth_idf:是否使用平滑的IDF计算sublinear_tf:是否使用亚线性TF缩放(log(1+TF))Gensim是一个专门用于主题建模和文档相似性分析的Python库,也提供了TF-IDF的实现:
from gensim import corpora, models
from gensim.utils import simple_preprocess
def tfidf_with_gensim():
# 示例文档集合
documents = [
"TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.",
"The TF-IDF value increases proportionally to the number of times a word appears in the document.",
"TF-IDF is often used as a weighting factor in searches of information retrieval, text mining, and user modeling."
]
# 预处理文档(分词)
processed_docs = [simple_preprocess(doc) for doc in documents]
# 创建词典
dictionary = corpora.Dictionary(processed_docs)
# 创建语料库(词袋表示)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# 训练TF-IDF模型
tfidf_model = models.TfidfModel(corpus)
# 获取TF-IDF表示
corpus_tfidf = tfidf_model[corpus]
# 打印结果
print("词汇表:", [(i, dictionary[i]) for i in dictionary.keys()])
# 查看每篇文档的TF-IDF表示
for i, doc_tfidf in enumerate(corpus_tfidf):
print(f"\n文档 {i+1} 的TF-IDF表示:")
# 转换为词汇-权重格式并排序
sorted_items = sorted(doc_tfidf, key=lambda x: x[1], reverse=True)
for word_id, score in sorted_items[:5]:
print(f" {dictionary[word_id]}: {score:.4f}")
tfidf_with_gensim()Gensim的TF-IDF实现特别适合处理大规模文档集合,因为它采用了流式处理方式,不需要一次性将所有数据加载到内存中。此外,Gensim还提供了与其他主题建模技术(如LDA、LSI)的无缝集成,便于进行更深入的文本分析。
三种TF-IDF实现方法各有特点:
实现方法 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
手动实现 | 完全可控,深入理解原理 | 效率低,功能有限 | 学习和教学 |
scikit-learn | 集成度高,参数丰富,与机器学习工作流无缝衔接 | 处理大规模数据可能内存不足 | 一般机器学习任务 |
Gensim | 内存高效,支持流式处理,与主题建模集成 | API相对复杂,学习曲线较陡 | 大规模文档分析,主题建模 |
在实际应用中,我们应根据具体需求和数据规模选择合适的实现方法。对于大多数日常任务,scikit-learn的TfidfVectorizer提供了良好的平衡,既简单易用又功能强大。
TF-IDF的效果很大程度上依赖于文本预处理的质量。良好的预处理可以提高特征提取的效果,减少噪声,提升后续任务的性能。常见的文本预处理步骤包括:
下面我们通过scikit-learn的Pipeline展示预处理与TF-IDF的结合:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
import re
import string
from nltk.stem import WordNetLemmatizer
import nltk
# 下载必要的NLTK资源
# nltk.download('wordnet')
# nltk.download('omw-1.4')
def advanced_preprocessing(text):
"""高级文本预处理函数"""
# 转小写
text = text.lower()
# 移除URL
url_pattern = r'https?://\S+|www\.\S+'
text = re.sub(url_pattern, '', text)
# 移除HTML标签
html_pattern = r'<[^>]+>'
text = re.sub(html_pattern, '', text)
# 移除标点符号
text = text.translate(str.maketrans('', '', string.punctuation))
# 移除数字
text = re.sub(r'\d+', '', text)
# 分词
words = text.split()
# 停用词过滤
stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'of', 'for', 'with', 'by', 'from', 'as', 'is', 'are'}
words = [word for word in words if word not in stop_words]
# 词形还原
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
def create_tfidf_pipeline():
# 创建预处理转换器
preprocessor = FunctionTransformer(lambda X: [advanced_preprocessing(text) for text in X])
# 创建TF-IDF向量器
tfidf_vectorizer = TfidfVectorizer(
ngram_range=(1, 2), # 同时考虑单词和双词组合
max_df=0.8, # 忽略在80%以上文档中出现的词
min_df=2, # 忽略在少于2篇文档中出现的词
max_features=1000, # 最多保留1000个特征
sublinear_tf=True # 使用亚线性TF缩放
)
# 创建管道
pipeline = Pipeline([
('preprocessing', preprocessor),
('tfidf', tfidf_vectorizer)
])
return pipeline
def demonstrate_pipeline():
# 示例文档
documents = [
"TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in information retrieval and text mining.",
"TF-IDF is used to evaluate how important a word is to a document in a collection or corpus.",
"The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.",
"TF-IDF was invented for document search and information retrieval systems and is still very useful today!",
"Check out https://scikit-learn.org for more information on TF-IDF implementation in Python."
]
# 创建并拟合管道
pipeline = create_tfidf_pipeline()
tfidf_matrix = pipeline.fit_transform(documents)
# 获取特征名
feature_names = pipeline.named_steps['tfidf'].get_feature_names_out()
print("预处理后的词汇表大小:", len(feature_names))
print("部分特征:", feature_names[:10])
# 查看每篇文档的重要词
for i in range(len(documents)):
doc_vector = tfidf_matrix[i].toarray()[0]
top_indices = doc_vector.argsort()[-5:][::-1]
print(f"\n文档 {i+1} 的前5个重要词:")
for idx in top_indices:
print(f" {feature_names[idx]}: {doc_vector[idx]:.4f}")
demonstrate_pipeline()TF-IDF有多个参数可以调整,以获得更好的性能。下面我们将讨论一些关键参数的调优策略:
ngram_range参数控制是否考虑单词组合。例如,当ngram_range=(1, 2)时,会同时考虑单个词(unigram)和双词组合(bigram)。
from sklearn.feature_extraction.text import TfidfVectorizer
def compare_ngram_ranges():
documents = [
"TF-IDF is a numerical statistic.",
"It reflects the importance of a word in a document.",
"TF-IDF is used in information retrieval and text mining."
]
# 测试不同的ngram_range
ngram_ranges = [(1, 1), (1, 2), (2, 2)]
for ngram_range in ngram_ranges:
print(f"\nngram_range={ngram_range}:")
vectorizer = TfidfVectorizer(ngram_range=ngram_range, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
print(f"词汇表大小: {len(feature_names)}")
print(f"部分特征: {feature_names[:10]}")
compare_ngram_ranges()使用n-gram的优点是可以捕获词语之间的顺序关系,更好地表示短语的含义。缺点是会显著增加特征数量,可能导致维度灾难和过拟合。在实践中,通常使用(1, 1)或(1, 2)作为默认值,具体取决于任务需求。
max_df和min_df参数用于过滤词频过高或过低的词:
max_df:忽略在超过该比例(或数量)文档中出现的词,用于过滤停用词min_df:忽略在少于该数量(或比例)文档中出现的词,用于过滤罕见词def compare_df_thresholds():
documents = [
"TF-IDF is a statistical measure.",
"TF-IDF evaluates word importance.",
"TF-IDF is used in search engines.",
"Document frequency measures how common a word is.",
"Term frequency counts occurrences in a document."
]
# 测试不同的max_df和min_df组合
thresholds = [
(1.0, 1), # 不过滤任何词
(0.8, 1), # 过滤在80%以上文档中出现的词
(1.0, 2) # 过滤在少于2篇文档中出现的词
]
for max_df, min_df in thresholds:
print(f"\nmax_df={max_df}, min_df={min_df}:")
vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
print(f"词汇表大小: {len(feature_names)}")
print(f"特征: {feature_names}")
compare_df_thresholds()合理设置这些参数可以有效地减少特征空间,提高模型效率,同时保留重要的信息。通常,max_df设置为0.7-0.9之间的值,min_df设置为2-5之间的值,具体取决于数据集的大小和特点。
sublinear_tf参数控制是否使用亚线性TF缩放,即是否使用log(1+TF)代替原始TF值:
def compare_sublinear_tf():
documents = [
"TF-IDF TF-IDF TF-IDF is important important important for text analysis.",
"This document contains some important terms for TF-IDF analysis."
]
# 比较使用和不使用sublinear_tf的效果
for sublinear_tf in [False, True]:
print(f"\nsublinear_tf={sublinear_tf}:")
vectorizer = TfidfVectorizer(sublinear_tf=sublinear_tf, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
# 打印每篇文档的TF-IDF值
for i, doc in enumerate(documents):
print(f"文档 {i+1}:")
doc_vector = tfidf_matrix[i].toarray()[0]
for j, word in enumerate(feature_names):
print(f" {word}: {doc_vector[j]:.4f}")
compare_sublinear_tf()使用亚线性TF缩放可以降低高频词的权重,使TF-IDF值更加平滑,避免个别高频词主导整个向量表示。这在处理包含大量重复词的文档时特别有用。
不同领域的文本可能有不同的特点,因此TF-IDF参数的最佳设置也可能不同。以下是针对几个常见领域的优化建议:
以下是一个针对新闻文本的TF-IDF优化示例:
def news_text_tfidf():
# 示例新闻文本
news_articles = [
"苹果公司周四宣布,将在下一季度推出新款iPhone和iPad产品,预计将提振公司股价。",
"美联储主席表示,美国经济增长势头良好,但仍需关注通货膨胀风险。",
"中国科学家在量子计算领域取得重大突破,成功研发出100量子比特的量子计算机。",
"特斯拉Model Y在中国市场销量持续增长,今年前三个月同比增长超过40%。",
"联合国气候变化大会在纽约召开,各国领导人承诺采取更积极的减排措施。"
]
# 中文文本的特殊处理
def chinese_preprocessing(text):
# 这里仅作示例,实际应用中可能需要使用专门的中文分词工具如jieba
import jieba
# 移除标点符号和数字
import re
text = re.sub(r'[\p{P}\d]+', '', text)
# 分词
words = jieba.lcut(text)
# 过滤停用词
stop_words = {'的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你'}
words = [word for word in words if word not in stop_words and len(word) > 1]
return ' '.join(words)
# 创建适用于新闻文本的TF-IDF向量化器
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
try:
import jieba
# 创建预处理转换器
preprocessor = FunctionTransformer(lambda X: [chinese_preprocessing(text) for text in X])
# 创建TF-IDF向量器
tfidf_vectorizer = TfidfVectorizer(
ngram_range=(1, 2),
max_df=0.8,
min_df=1,
max_features=200
)
# 创建管道
pipeline = Pipeline([
('preprocessing', preprocessor),
('tfidf', tfidf_vectorizer)
])
# 拟合并转换
tfidf_matrix = pipeline.fit_transform(news_articles)
feature_names = pipeline.named_steps['tfidf'].get_feature_names_out()
print("新闻文本TF-IDF分析:")
print(f"词汇表大小: {len(feature_names)}")
# 查看每篇文章的重要词
for i, article in enumerate(news_articles):
doc_vector = tfidf_matrix[i].toarray()[0]
top_indices = doc_vector.argsort()[-5:][::-1]
print(f"\n文章 {i+1} 的前5个重要词/短语:")
for idx in top_indices:
print(f" {feature_names[idx]}: {doc_vector[idx]:.4f}")
except ImportError:
print("请先安装jieba分词库: pip install jieba")
news_text_tfidf()TF-IDF将文档表示为向量后,我们可以使用各种相似度度量方法来计算文档之间的相似程度。常用的相似度度量方法包括:
余弦相似度(Cosine Similarity):
余弦相似度测量两个向量之间的夹角余弦值,忽略向量的长度:
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)其中,A·B是向量的点积,||A||和||B||是向量的欧几里得范数。
欧几里得距离(Euclidean Distance):
欧几里得距离测量两个向量在多维空间中的直线距离:
euclidean_distance(A, B) = sqrt(Σ(A_i - B_i)^2)曼哈顿距离(Manhattan Distance):
曼哈顿距离测量两个向量在多维空间中沿坐标轴的绝对距离之和:
manhattan_distance(A, B) = Σ|A_i - B_i|Jaccard相似度(Jaccard Similarity):
Jaccard相似度测量两个集合的交集与并集的比值:
jaccard_similarity(A, B) = |A ∩ B| / |A ∪ B|在文本相似度计算中,余弦相似度是最常用的方法,因为它对文档长度不敏感,更适合比较不同长度的文档。
下面我们使用scikit-learn实现基于TF-IDF的余弦相似度计算:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
def compute_document_similarity():
# 示例文档集合
documents = [
"TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。",
"TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的频率高,并且在其他文章中很少出现,则认为此词或短语具有很好的类别区分能力。",
"文本相似度计算是自然语言处理中的重要任务,可以用于文本聚类、信息检索等领域。",
"余弦相似度是计算文档向量相似度的常用方法,它度量的是两个向量之间的夹角,而不是它们的长度。",
"自然语言处理是计算机科学、人工智能和语言学领域的交叉学科,研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。"
]
# 中文分词处理(示例中省略具体实现,实际应用中应使用jieba等工具)
# 这里简单地按字符切分作为示例
def simple_chinese_tokenizer(text):
return list(text) # 简单按字符切分,实际应用中应使用专业分词工具
# 创建TF-IDF向量器
try:
# 尝试使用jieba分词
import jieba
def chinese_tokenizer(text):
return jieba.lcut(text) # 使用jieba分词
vectorizer = TfidfVectorizer(tokenizer=chinese_tokenizer, stop_words=None)
except ImportError:
# 如果没有安装jieba,使用简单字符切分
print("使用简单字符切分,建议安装jieba以获得更好的分词效果")
vectorizer = TfidfVectorizer(tokenizer=simple_chinese_tokenizer, stop_words=None)
# 计算TF-IDF矩阵
tfidf_matrix = vectorizer.fit_transform(documents)
# 计算余弦相似度矩阵
similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
# 转换为DataFrame以便查看
similarity_df = pd.DataFrame(
similarity_matrix,
index=[f"文档{i+1}" for i in range(len(documents))],
columns=[f"文档{i+1}" for i in range(len(documents))]
)
print("文档相似度矩阵(余弦相似度):")
print(similarity_df.round(4))
# 查找每篇文档最相似的文档
print("\n每篇文档最相似的文档:")
for i in range(len(documents)):
# 排除自身
similarities = similarity_matrix[i].copy()
similarities[i] = -1 # 设置自身相似度为-1
# 找到最相似的文档索引
most_similar_idx = similarities.argmax()
max_similarity = similarities[most_similar_idx]
print(f"文档{i+1}最相似的是文档{most_similar_idx+1},相似度为{max_similarity:.4f}")
# 计算查询与文档的相似度
def compute_query_similarity(query, documents, vectorizer, tfidf_matrix):
# 转换查询为TF-IDF向量
query_vector = vectorizer.transform([query])
# 计算查询与每个文档的相似度
query_similarities = cosine_similarity(query_vector, tfidf_matrix)[0]
# 排序并返回结果
results = [(i+1, query_similarities[i], documents[i])
for i in range(len(documents))]
results.sort(key=lambda x: x[1], reverse=True)
return results
# 测试查询
query = "TF-IDF向量和文本相似度计算"
print(f"\n查询: {query}")
print("与文档的相似度排序:")
results = compute_query_similarity(query, documents, vectorizer, tfidf_matrix)
for doc_id, similarity, doc_text in results:
print(f"文档{doc_id}: 相似度={similarity:.4f}, 内容: {doc_text[:30]}...")
compute_document_similarity()当文档集合规模很大时,直接计算两两之间的相似度会非常耗时。下面介绍几种高效计算大规模文档相似度的方法:
倒排索引是信息检索系统中常用的数据结构,可以快速找到包含特定词的文档。结合倒排索引,我们可以只计算那些有共同词的文档之间的相似度,从而减少计算量。
def build_inverted_index(documents, vectorizer):
"""构建简单的倒排索引"""
# 获取TF-IDF矩阵和特征名
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
# 构建倒排索引 {word: [doc_ids]}
inverted_index = {}
for i, word in enumerate(feature_names):
# 获取包含当前词的文档索引
doc_indices = tfidf_matrix[:, i].nonzero()[0]
inverted_index[word] = doc_indices.tolist()
return inverted_index, tfidf_matrix, feature_names
def find_similar_documents(query, documents, vectorizer, inverted_index, tfidf_matrix, top_k=5):
"""使用倒排索引查找相似文档"""
# 预处理查询
query_vector = vectorizer.transform([query])
# 获取查询中的词
query_words = vectorizer.get_feature_names_out()[query_vector.nonzero()[1]]
# 找到可能相关的文档
candidate_docs = set()
for word in query_words:
if word in inverted_index:
candidate_docs.update(inverted_index[word])
# 如果没有候选文档,返回空列表
if not candidate_docs:
return []
# 计算查询与候选文档的相似度
candidate_docs_list = list(candidate_docs)
candidate_tfidf = tfidf_matrix[candidate_docs_list]
similarities = cosine_similarity(query_vector, candidate_tfidf)[0]
# 排序并返回结果
results = [(candidate_docs_list[i], similarities[i], documents[candidate_docs_list[i]])
for i in range(len(candidate_docs_list))]
results.sort(key=lambda x: x[1], reverse=True)
return results[:top_k]使用降维技术(如PCA、SVD)可以将高维的TF-IDF向量投影到低维空间,从而减少计算复杂度:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
def build_dimension_reduction_pipeline():
"""构建包含降维的TF-IDF管道"""
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('svd', TruncatedSVD(n_components=100)) # 降维到100维
])
return pipeline
def compute_similarity_with_dimension_reduction(documents, queries):
"""使用降维计算文档相似度"""
# 构建并拟合管道
pipeline = build_dimension_reduction_pipeline()
reduced_matrix = pipeline.fit_transform(documents)
# 转换查询
query_vectors = pipeline.transform(queries)
# 计算相似度
similarities = cosine_similarity(query_vectors, reduced_matrix)
return similarities局部敏感哈希(Locality Sensitive Hashing, LSH)是一种能够快速查找近似最近邻的数据结构。它可以将相似的向量映射到相同的哈希桶中,从而大幅减少需要比较的向量数量:
# 需要安装lshash库: pip install lshash
def find_similar_documents_with_lsh(documents, query, n_neighbors=5):
"""使用局部敏感哈希查找相似文档"""
try:
from lshash import LSHash
# 计算TF-IDF向量
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
# 转换为稠密向量
dense_vectors = tfidf_matrix.toarray()
# 创建LSH对象
dimensions = len(feature_names)
num_hashtables = 5
lsh = LSHash(num_hashtables, dimensions)
# 将文档向量添加到LSH中
for i, vector in enumerate(dense_vectors):
lsh.index(vector, extra_data={'doc_id': i, 'text': documents[i]})
# 转换查询
query_vector = vectorizer.transform([query]).toarray()[0]
# 查找相似文档
results = lsh.query(query_vector, num_results=n_neighbors)
# 格式化结果
similar_docs = []
for res in results:
similar_docs.append({
'doc_id': res[0][1]['doc_id'],
'text': res[0][1]['text'],
'distance': res[1]
})
return similar_docs
except ImportError:
print("请安装lshash库: pip install lshash")
return []下面我们展示一个基于TF-IDF和余弦相似度的简单问答系统示例:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
def build_qa_system():
# 示例问答库
qa_pairs = [
{"question": "什么是TF-IDF?", "answer": "TF-IDF是一种用于评估词语对一个文档集或一个语料库中的其中一份文档的重要程度的统计方法。"},
{"question": "TF-IDF的计算公式是什么?", "answer": "TF-IDF = 词频(TF) × 逆文档频率(IDF),其中TF是词在文档中出现的频率,IDF是log(总文档数/包含该词的文档数)。"},
{"question": "TF-IDF有什么应用?", "answer": "TF-IDF常用于信息检索、文本分类、文档聚类、关键词提取等自然语言处理任务。"},
{"question": "如何实现文本相似度计算?", "answer": "常用的方法是将文本转换为TF-IDF向量,然后计算向量之间的余弦相似度。"},
{"question": "TF-IDF与词嵌入有什么区别?", "answer": "TF-IDF是基于词频统计的方法,不考虑词的语义关系;而词嵌入能够捕获词的语义信息,生成低维的密集向量。"}
]
# 提取问题和答案
questions = [pair["question"] for pair in qa_pairs]
answers = [pair["answer"] for pair in qa_pairs]
# 文本预处理
def preprocess(text):
# 转小写
text = text.lower()
# 移除标点符号
text = re.sub(r'[\W]+', ' ', text)
return text
# 预处理问题
processed_questions = [preprocess(q) for q in questions]
# 创建TF-IDF向量器
vectorizer = TfidfVectorizer()
# 拟合问题向量
question_vectors = vectorizer.fit_transform(processed_questions)
# 问答函数
def answer_question(query, threshold=0.3):
# 预处理查询
processed_query = preprocess(query)
# 转换查询为向量
query_vector = vectorizer.transform([processed_query])
# 计算与所有问题的相似度
similarities = cosine_similarity(query_vector, question_vectors)[0]
# 找到最相似的问题
max_idx = np.argmax(similarities)
max_similarity = similarities[max_idx]
# 如果相似度低于阈值,返回不知道
if max_similarity < threshold:
return f"抱歉,我无法回答这个问题。(最高相似度: {max_similarity:.4f})"
# 返回对应的答案
return {
"question": questions[max_idx],
"answer": answers[max_idx],
"similarity": max_similarity
}
# 测试问答系统
test_queries = [
"TF-IDF的定义是什么?",
"如何计算文档之间的相似度?",
"TF-IDF在哪些场景中应用?",
"深度学习中的文本表示方法有哪些?"
]
print("问答系统测试:")
for query in test_queries:
result = answer_question(query)
print(f"\n查询: {query}")
if isinstance(result, dict):
print(f"最相似问题: {result['question']}")
print(f"答案: {result['answer']}")
print(f"相似度: {result['similarity']:.4f}")
else:
print(result)
build_qa_system()TF-IDF经常作为文本分类任务的特征表示方法,尤其是在传统机器学习算法中。下面展示一个使用TF-IDF进行文本分类的示例:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
def text_classification_with_tfidf():
# 加载数据集(选择几个类别以减少计算量)
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
# 分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
newsgroups.data, newsgroups.target, test_size=0.3, random_state=42
)
print(f"数据集大小: {len(newsgroups.data)}")
print(f"类别: {list(newsgroups.target_names)}")
# 创建分类管道
text_clf = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', max_df=0.7)),
('clf', MultinomialNB())
])
# 训练模型
text_clf.fit(X_train, y_train)
# 预测
y_pred = text_clf.predict(X_test)
# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"\n分类准确率: {accuracy:.4f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))
# 查看最重要的特征
feature_names = text_clf.named_steps['tfidf'].get_feature_names_out()
for i, category in enumerate(newsgroups.target_names):
top_features = np.argsort(text_clf.named_steps['clf'].coef_[i])[-10:][::-1]
print(f"\n类别 '{category}' 的前10个重要特征:")
for idx in top_features:
print(f" {feature_names[idx]}")
text_classification_with_tfidf()TF-IDF也常用于文本聚类,将相似的文档自动归类:
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
def text_clustering_with_tfidf():
# 加载数据集
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
print(f"数据集大小: {len(newsgroups.data)}")
print(f"真实类别: {list(newsgroups.target_names)}")
# 创建聚类管道(包含降维以可视化)
clustering_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', max_df=0.7)),
('svd', TruncatedSVD(n_components=2)), # 降维到2维以便可视化
('kmeans', KMeans(n_clusters=4, random_state=42))
])
# 拟合模型
labels = clustering_pipeline.fit_predict(newsgroups.data)
# 获取降维后的坐标用于可视化
svd_transformer = clustering_pipeline.named_steps['svd']
tfidf_vectorizer = clustering_pipeline.named_steps['tfidf']
tfidf_matrix = tfidf_vectorizer.transform(newsgroups.data)
reduced_matrix = svd_transformer.transform(tfidf_matrix)
# 可视化聚类结果
plt.figure(figsize=(10, 6))
scatter = plt.scatter(reduced_matrix[:, 0], reduced_matrix[:, 1], c=labels, cmap='viridis')
plt.colorbar(scatter, label='聚类标签')
plt.title('基于TF-IDF的文本聚类结果 (降维到2维)')
plt.xlabel('SVD Component 1')
plt.ylabel('SVD Component 2')
plt.savefig('text_clustering.png')
print("聚类可视化结果已保存为 text_clustering.png")
# 分析每个聚类的特征词
feature_names = tfidf_vectorizer.get_feature_names_out()
kmeans = clustering_pipeline.named_steps['kmeans']
print("\n每个聚类的前10个特征词:")
for i in range(4):
# 找出对当前聚类贡献最大的特征
top_features = np.argsort(kmeans.cluster_centers_[i])[-10:][::-1]
print(f"\n聚类 {i}:")
for idx in top_features:
print(f" {feature_names[idx]}")
text_clustering_with_tfidf()TF-IDF可以用来自动提取文档中的关键词:
def extract_keywords_with_tfidf(documents, top_n=5):
"""使用TF-IDF从文档中提取关键词"""
# 创建TF-IDF向量器
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=1)
# 计算TF-IDF矩阵
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
# 提取每篇文档的关键词
keywords_list = []
for i in range(len(documents)):
# 获取当前文档的TF-IDF向量
doc_vector = tfidf_matrix[i].toarray()[0]
# 排序并获取前N个关键词
top_indices = doc_vector.argsort()[-top_n:][::-1]
keywords = [(feature_names[idx], doc_vector[idx]) for idx in top_indices]
keywords_list.append(keywords)
return keywords_list
def demonstrate_keyword_extraction():
# 示例文档
documents = [
"自然语言处理是人工智能领域的重要分支,研究如何使计算机能够理解和生成人类语言。",
"TF-IDF是一种常用的文本特征提取方法,广泛应用于信息检索和数据挖掘领域。",
"机器学习算法包括监督学习、无监督学习和强化学习等多种类型。",
"深度学习在图像识别、语音识别和自然语言处理等领域取得了突破性进展。"
]
# 尝试使用jieba分词
try:
import jieba
# 中文分词处理
def chinese_tokenizer(text):
return jieba.lcut(text)
# 创建TF-IDF向量器
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=chinese_tokenizer)
# 计算TF-IDF矩阵
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
# 提取关键词
print("每篇文档的关键词:")
for i, doc in enumerate(documents):
doc_vector = tfidf_matrix[i].toarray()[0]
top_indices = doc_vector.argsort()[-5:][::-1]
keywords = [(feature_names[idx], doc_vector[idx]) for idx in top_indices]
print(f"\n文档 {i+1}: {doc}")
print("关键词:")
for word, score in keywords:
print(f" {word}: {score:.4f}")
except ImportError:
print("请安装jieba分词库以进行中文关键词提取: pip install jieba")
# 如果没有jieba,使用英文示例
english_docs = [
"Natural language processing is an important branch of artificial intelligence that studies how computers can understand and generate human language.",
"TF-IDF is a commonly used text feature extraction method widely applied in information retrieval and data mining.",
"Machine learning algorithms include supervised learning, unsupervised learning, and reinforcement learning.",
"Deep learning has made breakthrough progress in image recognition, speech recognition, and natural language processing."
]
keywords_list = extract_keywords_with_tfidf(english_docs)
print("英文文档的关键词:")
for i, (doc, keywords) in enumerate(zip(english_docs, keywords_list)):
print(f"\n文档 {i+1}: {doc[:50]}...")
print("关键词:")
for word, score in keywords:
print(f" {word}: {score:.4f}")
demonstrate_keyword_extraction()TF-IDF可以用于内容推荐系统,通过计算用户兴趣与内容的相似度来推荐相关内容:
def content_based_recommendation(user_preferences, documents, top_n=3):
"""基于内容的推荐系统"""
# 合并用户偏好和文档以构建统一的词汇表
all_texts = [user_preferences] + documents
# 创建TF-IDF向量器
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
# 计算TF-IDF矩阵
tfidf_matrix = vectorizer.fit_transform(all_texts)
# 用户向量是第一行
user_vector = tfidf_matrix[0]
# 文档向量是其余行
doc_vectors = tfidf_matrix[1:]
# 计算用户向量与每个文档向量的相似度
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(user_vector, doc_vectors)[0]
# 排序并获取最相似的文档
top_indices = similarities.argsort()[-top_n:][::-1]
recommendations = [(i+1, similarities[i], documents[i]) for i in top_indices]
return recommendations
def demonstrate_recommendation():
# 示例用户偏好
user_preference = "我对人工智能、机器学习和自然语言处理技术特别感兴趣,尤其是深度学习在NLP领域的应用。"
# 示例文章集合
articles = [
"深度学习框架对比:TensorFlow、PyTorch和JAX的优缺点分析。本文详细比较了主流深度学习框架的性能和适用场景。",
"自然语言处理最新进展:预训练语言模型如BERT、GPT的工作原理和应用案例。",
"计算机视觉技术在自动驾驶中的应用,包括目标检测、语义分割和路径规划等核心技术。",
"推荐系统算法详解:协同过滤、矩阵分解和深度学习方法的实现与优化。",
"大语言模型时代的NLP应用开发:从提示工程到模型微调的全流程指南。"
]
# 尝试中文推荐
try:
import jieba
# 中文分词处理
def chinese_tokenizer(text):
return jieba.lcut(text)
# 创建TF-IDF向量器
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=chinese_tokenizer)
# 计算TF-IDF矩阵
all_texts = [user_preference] + articles
tfidf_matrix = vectorizer.fit_transform(all_texts)
# 用户向量和文档向量
user_vector = tfidf_matrix[0]
doc_vectors = tfidf_matrix[1:]
# 计算相似度
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(user_vector, doc_vectors)[0]
# 获取推荐
top_n = 3
top_indices = similarities.argsort()[-top_n:][::-1]
print(f"基于用户偏好的推荐文章(相似度从高到低):")
for i, idx in enumerate(top_indices):
print(f"\n推荐 {i+1}: (相似度: {similarities[idx]:.4f})")
print(f" {articles[idx]}")
except ImportError:
print("请安装jieba分词库以进行中文推荐: pip install jieba")
# 使用英文示例
en_preference = "I am interested in artificial intelligence, machine learning, and natural language processing, especially deep learning applications in NLP."
en_articles = [
"Comparison of deep learning frameworks: TensorFlow, PyTorch, and JAX pros and cons.",
"Recent advances in NLP: Working principles and applications of pre-trained language models like BERT and GPT.",
"Computer vision techniques in autonomous driving, including object detection and semantic segmentation.",
"Recommendation system algorithms: Collaborative filtering, matrix factorization, and deep learning methods.",
"NLP application development in the era of large language models: From prompt engineering to fine-tuning."
]
recommendations = content_based_recommendation(en_preference, en_articles)
print(f"\nRecommended articles based on user preference:")
for i, (_, similarity, article) in enumerate(recommendations):
print(f"\nRecommendation {i+1}: (Similarity: {similarity:.4f})")
print(f" {article}")
demonstrate_recommendation()词嵌入(如Word2Vec、GloVe)和TF-IDF是两种不同的文本表示方法,各有优劣:
特性 | TF-IDF | 词嵌入 |
|---|---|---|
表示方式 | 稀疏向量,基于词频统计 | 稠密向量,基于上下文共现 |
语义捕捉 | 几乎无法捕捉语义关系 | 能有效捕捉词的语义和语法关系 |
词序信息 | 忽略词序,基于词袋模型 | 部分考虑词序(Word2Vec SGNS等) |
计算复杂度 | 较低,线性复杂度 | 较高,需要预训练 |
可解释性 | 强,权重有明确含义 | 弱,向量维度没有直接解释 |
内存占用 | 高(稀疏矩阵) | 中等(稠密向量) |
未登录词处理 | 差,无法处理未见过的词 | 很差,需要额外的OOV处理 |
领域适应 | 直接使用,无需调整 | 需要在特定领域微调 |
预训练语言模型(如BERT、GPT)代表了当前最先进的文本表示技术,与TF-IDF相比有显著优势:
当然,预训练语言模型也有其局限性:计算资源消耗大、推理速度慢、可解释性差等。在一些资源受限或追求可解释性的场景中,TF-IDF仍然有其应用价值。
尽管有了更先进的文本表示方法,TF-IDF仍然可以与这些方法融合,发挥互补作用:
下面是一个简单的特征融合示例:
def feature_fusion_example():
"""TF-IDF与Word2Vec特征融合示例"""
try:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from sklearn.preprocessing import StandardScaler
import numpy as np
# 示例文档
documents = [
"TF-IDF is a traditional text representation method.",
"Word embeddings capture semantic relationships between words.",
"Modern NLP models like BERT have revolutionized text understanding.",
"Text classification can benefit from combining multiple feature types."
]
# 预处理和分词
tokenized_docs = [doc.lower().split() for doc in documents]
# 计算TF-IDF特征
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(documents)
print(f"TF-IDF特征维度: {tfidf_features.shape[1]}")
# 计算Word2Vec特征(文档向量为词向量的平均值)
word2vec_model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1)
def doc_to_vec(tokens, model):
word_vectors = [model.wv[word] for word in tokens if word in model.wv]
if not word_vectors:
return np.zeros(model.vector_size)
return np.mean(word_vectors, axis=0)
word2vec_features = np.array([doc_to_vec(tokens, word2vec_model) for tokens in tokenized_docs])
print(f"Word2Vec特征维度: {word2vec_features.shape[1]}")
# 标准化Word2Vec特征
scaler = StandardScaler()
word2vec_features_scaled = scaler.fit_transform(word2vec_features)
# 融合特征(将TF-IDF的稀疏矩阵转换为稠密矩阵后拼接)
tfidf_dense = tfidf_features.toarray()
fused_features = np.hstack([tfidf_dense, word2vec_features_scaled])
print(f"融合后的特征维度: {fused_features.shape[1]}")
# 这里可以将融合特征用于后续的分类、聚类等任务
print("\n融合特征前5个样本:")
print(fused_features[:5, :5]) # 仅显示前5个样本的前5个特征
except ImportError:
print("请安装必要的库: pip install scikit-learn gensim")
# feature_fusion_example() # 取消注释运行示例在多语言环境中应用TF-IDF面临一些特殊挑战,主要包括:
解决方案:
def multilingual_tfidf_example():
"""多语言TF-IDF处理示例"""
try:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import numpy as np
# 示例文档(中英双语)
documents = {
'en': [
"TF-IDF is a numerical statistic used in information retrieval.",
"Natural language processing techniques are widely used in modern applications."
],
'zh': [
"TF-IDF是一种用于信息检索的数值统计方法。",
"自然语言处理技术在现代应用中被广泛使用。"
]
}
# 中文分词函数
def chinese_tokenizer(text):
try:
import jieba
return jieba.lcut(text)
except ImportError:
print("警告: jieba未安装,使用简单字符切分")
return list(text)
# 创建多语言TF-IDF处理函数
def create_multilingual_tfidf(lang='en'):
if lang == 'zh':
# 中文TF-IDF
return Pipeline([
('tokenizer', FunctionTransformer(
lambda x: [' '.join(chinese_tokenizer(doc)) for doc in x]
)),
('tfidf', TfidfVectorizer())
])
else:
# 英文TF-IDF
return TfidfVectorizer(stop_words='english')
# 处理英文文档
en_tfidf = create_multilingual_tfidf('en')
en_features = en_tfidf.fit_transform(documents['en'])
# 处理中文文档
zh_tfidf = create_multilingual_tfidf('zh')
zh_features = zh_tfidf.fit_transform(documents['zh'])
print("英文TF-IDF特征:")
print(en_tfidf.get_feature_names_out())
print(en_features.toarray())
print("\n中文TF-IDF特征:")
print(zh_tfidf.named_steps['tfidf'].get_feature_names_out())
print(zh_features.toarray())
# 对于跨语言应用,可以考虑使用机器翻译将所有文档转换为同一种语言
# 或者使用多语言预训练模型的向量表示
except Exception as e:
print(f"处理过程中出错: {e}")
# multilingual_tfidf_example() # 取消注释运行示例当处理大规模文档集时,TF-IDF可能面临以下挑战:
解决方案:
def large_scale_tfidf():
"""大规模文档集TF-IDF处理策略"""
# 1. 增量学习
def incremental_tfidf_example():
from sklearn.feature_extraction.text import HashingVectorizer
import numpy as np
# 使用HashingVectorizer可以避免存储词汇表,适合大规模数据
vectorizer = HashingVectorizer(n_features=2**18, alternate_sign=False)
# 模拟大规模文档流
def document_stream(batch_size=1000):
# 这里只是示例,实际应用中可能是从文件或数据库读取
for i in range(10): # 模拟10个批次
batch = [f"This is document {j+i*batch_size} about TF-IDF" for j in range(batch_size)]
yield batch
# 增量处理文档流
print("处理大规模文档流...")
for i, batch in enumerate(document_stream()):
# 转换当前批次
batch_features = vectorizer.transform(batch)
# 在实际应用中,这里可以处理或保存当前批次的特征
print(f"处理批次 {i+1}, 特征维度: {batch_features.shape}")
# 2. 降维和压缩
def dimensionality_reduction_example():
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
# 创建降维管道
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('svd', TruncatedSVD(n_components=100)) # 降维到100维
])
print("TF-IDF降维管道已创建")
# 在实际应用中,可以使用pipeline.fit_transform处理大规模文档
# 3. 分布式计算
def distributed_computing_notes():
print("\n分布式TF-IDF计算策略:")
print("1. 使用Spark MLlib的HashingTF和IDF进行分布式计算")
print("2. 使用Dask进行并行化处理")
print("3. 分块处理大型词汇表")
# 运行示例函数
print("大规模文档集TF-IDF处理示例:")
print("\n1. 增量学习示例:")
incremental_tfidf_example()
print("\n2. 降维示例:")
dimensionality_reduction_example()
distributed_computing_notes()
# large_scale_tfidf() # 取消注释运行示例在一些应用场景中,文档集合可能会不断更新,需要动态调整TF-IDF词汇表:
解决方案:
def dynamic_vocabulary_update():
"""动态词汇表更新示例"""
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# 初始文档集合
initial_docs = [
"TF-IDF is a traditional text representation method.",
"It's widely used in information retrieval and data mining."
]
# 新文档集合
new_docs = [
"Modern NLP models like BERT have introduced new approaches to text representation.",
"Deep learning techniques have improved text analysis performance significantly."
]
# 方法1: 重新训练整个模型
def retrain_approach():
print("\n方法1: 重新训练整个模型")
# 合并所有文档
all_docs = initial_docs + new_docs
# 重新训练TF-IDF模型
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(all_docs)
print(f"合并后的词汇表大小: {len(vectorizer.get_feature_names_out())}")
print("部分特征:", vectorizer.get_feature_names_out()[:10])
# 方法2: 增量更新(使用HashingVectorizer避免词汇表限制)
def incremental_approach():
print("\n方法2: 使用HashingVectorizer进行增量更新")
from sklearn.feature_extraction.text import HashingVectorizer
# 创建HashingVectorizer
vectorizer = HashingVectorizer(n_features=2**18, alternate_sign=False)
# 处理初始文档
initial_features = vectorizer.transform(initial_docs)
print(f"初始文档特征维度: {initial_features.shape}")
# 增量处理新文档
new_features = vectorizer.transform(new_docs)
print(f"新文档特征维度: {new_features.shape}")
# 方法3: 定期重建词汇表
def periodic_rebuild_strategy():
print("\n方法3: 定期重建词汇表策略")
print("1. 设置重建阈值(如文档数量增加10%或时间间隔)")
print("2. 维护词汇表版本和更新日志")
print("3. 平滑过渡,避免模型突变")
print("动态词汇表更新策略示例:")
retrain_approach()
incremental_approach()
periodic_rebuild_strategy()
# dynamic_vocabulary_update() # 取消注释运行示例短文本(如推文、评论、搜索查询)的TF-IDF表示面临特殊挑战:
解决方案:
def short_text_tfidf():
"""短文本TF-IDF处理示例"""
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# 示例短文本
short_texts = [
"TF-IDF good!",
"NLP techniques awesome",
"Text mining useful tool",
"Machine learning rocks",
"Deep learning better?"
]
print("短文本TF-IDF处理示例:")
# 1. 标准TF-IDF
standard_vectorizer = TfidfVectorizer(min_df=1)
standard_tfidf = standard_vectorizer.fit_transform(short_texts)
print("\n标准TF-IDF结果:")
print("词汇表:", standard_vectorizer.get_feature_names_out())
print("TF-IDF矩阵:")
print(standard_tfidf.toarray())
# 2. 使用字符n-gram
char_vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 3))
char_tfidf = char_vectorizer.fit_transform(short_texts)
print("\n字符n-gram TF-IDF结果:")
print(f"字符n-gram词汇表大小: {len(char_vectorizer.get_feature_names_out())}")
print("部分特征:", char_vectorizer.get_feature_names_out()[:10])
# 3. 短文本特化参数
short_vectorizer = TfidfVectorizer(
min_df=1, # 降低最小文档频率要求
use_idf=False, # 对于短文本,可能不需要IDF加权
norm=None # 不进行归一化
)
short_tfidf = short_vectorizer.fit_transform(short_texts)
print("\n短文本特化TF-IDF结果:")
print("词汇表:", short_vectorizer.get_feature_names_out())
print("TF-IDF矩阵:")
print(short_tfidf.toarray())
# 4. 短文本TF-IDF优化建议
print("\n短文本TF-IDF优化建议:")
print("1. 降低min_df阈值,保留更多稀有词")
print("2. 考虑使用字符n-gram捕获部分拼写和形态信息")
print("3. 对于非常短的文本,可能只使用TF而不使用IDF")
print("4. 增加外部知识,如预训练词嵌入或情感词典")
print("5. 考虑使用专门的短文本表示方法,如微博向量或短文本嵌入")
# short_text_tfidf() # 取消注释运行示例尽管TF-IDF是一种经典方法,但研究人员一直在提出各种改进变体以适应现代NLP的需求:
def modern_tfidf_variants():
"""TF-IDF现代变体示例"""
# BM25实现示例
def bm25_example():
print("\nBM25示例实现:")
# 简化版BM25实现
def calculate_bm25(query, documents, k1=1.5, b=0.75):
import math
from collections import Counter
# 计算文档平均长度
avg_doc_len = sum(len(doc.split()) for doc in documents) / len(documents)
# 构建倒排索引
inverted_index = {}
for i, doc in enumerate(documents):
terms = doc.split()
term_counts = Counter(terms)
for term, count in term_counts.items():
if term not in inverted_index:
inverted_index[term] = []
inverted_index[term].append((i, count))
# 计算BM25分数
scores = [0] * len(documents)
query_terms = query.split()
for term in query_terms:
if term not in inverted_index:
continue
# 计算IDF
df = len(inverted_index[term]) # 文档频率
idf = math.log((len(documents) - df + 0.5) / (df + 0.5) + 1)
# 对每个包含该词的文档计算分数
for doc_id, tf in inverted_index[term]:
doc_len = len(documents[doc_id].split())
# BM25公式
score = idf * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_len / avg_doc_len))
scores[doc_id] += score
return scores
# 示例
documents = [
"TF-IDF is a traditional method for text representation",
"BM25 is an improved version of TF-IDF used in information retrieval",
"Modern NLP models have surpassed traditional methods in many tasks"
]
query = "TF-IDF information retrieval"
scores = calculate_bm25(query, documents)
print(f"查询: {query}")
for i, score in enumerate(scores):
print(f"文档{i+1}得分: {score:.4f}, 内容: {documents[i]}")
# Field-aware TF-IDF示例
def field_aware_tfidf_example():
print("\nField-aware TF-IDF示例:")
# 模拟多字段文档
documents = [
{"title": "TF-IDF introduction", "body": "TF-IDF is a numerical statistic...", "tags": "nlp,text mining"},
{"title": "Text representation methods", "body": "Various methods exist for text representation...", "tags": "machine learning,nlp"},
{"title": "Information retrieval basics", "body": "Information retrieval systems help find relevant information...", "tags": "ir,search"}
]
# 计算各字段的TF-IDF并加权组合
from sklearn.feature_extraction.text import TfidfVectorizer
# 字段权重
field_weights = {"title": 3.0, "body": 1.0, "tags": 2.0}
# 为每个字段创建TF-IDF向量器
vectorizers = {}
field_vectors = {}
for field in field_weights:
# 提取该字段的所有文本
field_texts = [doc[field] for doc in documents]
# 创建并拟合向量器
vectorizers[field] = TfidfVectorizer(stop_words='english')
field_vectors[field] = vectorizers[field].fit_transform(field_texts)
# 应用权重
field_vectors[field] = field_vectors[field].multiply(field_weights[field])
# 合并所有字段的特征(使用不同的词汇表空间)
# 注意:实际应用中可能需要更复杂的合并策略
print("各字段TF-IDF权重:")
for field, weight in field_weights.items():
print(f"{field}: {weight}")
print("实际应用中需要合并不同字段的特征空间")
print("TF-IDF现代变体示例:")
bm25_example()
field_aware_tfidf_example()
# modern_tfidf_variants() # 取消注释运行示例在深度学习主导的NLP时代,TF-IDF仍然可以与深度学习方法结合,发挥独特价值:
def tfidf_deep_learning_integration():
"""TF-IDF与深度学习结合示例"""
# 简化的特征增强示例
def feature_enhancement_example():
print("\nTF-IDF特征增强深度学习模型示例:")
print("1. 使用TF-IDF作为额外特征输入到神经网络")
print("2. 将TF-IDF权重用于初始化嵌入层")
print("3. 结合TF-IDF和词嵌入特征进行分类")
# 以下是一个概念性代码框架
# import tensorflow as tf
# from sklearn.feature_extraction.text import TfidfVectorizer
#
# # 假设我们有以下组件
# tfidf_vectorizer = TfidfVectorizer()
# word_embedding_layer = tf.keras.layers.Embedding(vocab_size, embedding_dim)
#
# # 输入层
# tfidf_input = tf.keras.Input(shape=(tfidf_dim,))
# text_input = tf.keras.Input(shape=(max_seq_length,))
#
# # 词嵌入路径
# embedded = word_embedding_layer(text_input)
# lstm_output = tf.keras.layers.LSTM(64)(embedded)
#
# # 合并路径
# concatenated = tf.keras.layers.concatenate([lstm_output, tfidf_input])
# dense_output = tf.keras.layers.Dense(1)(concatenated)
#
# # 构建模型
# model = tf.keras.Model(inputs=[text_input, tfidf_input], outputs=dense_output)
# 混合检索系统示例
def hybrid_retrieval_example():
print("\n混合检索系统示例:")
print("结合TF-IDF的精确匹配和BERT的语义匹配")
# 概念性代码框架
# def hybrid_search(query, top_k=10):
# # 1. 使用TF-IDF检索初始候选文档
# tfidf_results = tfidf_retriever.search(query, top_k=100)
#
# # 2. 使用BERT对候选文档进行重排序
# candidate_docs = [doc['text'] for doc in tfidf_results]
# semantic_scores = bert_reranker.score(query, candidate_docs)
#
# # 3. 结合分数
# hybrid_results = []
# for i, (tfidf_doc, semantic_score) in enumerate(zip(tfidf_results, semantic_scores)):
# # 线性加权组合
# hybrid_score = 0.3 * tfidf_doc['score'] + 0.7 * semantic_score
# hybrid_results.append({
# 'doc_id': tfidf_doc['doc_id'],
# 'text': tfidf_doc['text'],
# 'score': hybrid_score
# })
#
# # 4. 排序并返回
# hybrid_results.sort(key=lambda x: x['score'], reverse=True)
# return hybrid_results[:top_k]
print("TF-IDF与深度学习结合示例:")
feature_enhancement_example()
hybrid_retrieval_example()
# tfidf_deep_learning_integration() # 取消注释运行示例尽管深度学习方法不断发展,TF-IDF在2025年仍将在以下领域保持重要地位:
def tfidf_future_trends():
"""2025年TF-IDF应用趋势预测"""
trends = [
{
"领域": "边缘计算设备",
"应用": "在资源受限的IoT设备上进行轻量级文本分析",
"优势": "低计算复杂度,适合实时处理",
"预测": "将开发更多针对特定硬件优化的TF-IDF实现"
},
{
"领域": "隐私保护计算",
"应用": "在本地设备上进行文本处理,避免敏感数据上传",
"优势": "不依赖云端,保护用户隐私",
"预测": "与联邦学习结合,实现分布式TF-IDF模型更新"
},
{
"领域": "多模态融合",
"应用": "作为文本模态的基础特征,与图像、音频等模态融合",
"优势": "提供可解释的文本特征表示",
"预测": "开发更多跨模态的TF-IDF变体"
},
{
"领域": "知识图谱增强",
"应用": "结合知识图谱信息,改进TF-IDF权重计算",
"优势": "引入外部知识,提高特征质量",
"预测": "知识增强型TF-IDF将在专业领域取得更好效果"
},
{
"领域": "实时流处理",
"应用": "在社交媒体流、新闻流等场景中进行实时分析",
"优势": "支持增量更新,低延迟",
"预测": "增量TF-IDF算法将得到进一步优化"
}
]
print("2025年TF-IDF应用趋势预测:")
for i, trend in enumerate(trends):
print(f"\n{trend['领域']}:")
print(f" 应用: {trend['应用']}")
print(f" 优势: {trend['优势']}")
print(f" 预测: {trend['预测']}")
# tfidf_future_trends() # 取消注释运行示例良好的文本预处理是TF-IDF效果的关键,以下是一些最佳实践:
def text_preprocessing_best_practices():
"""文本预处理最佳实践示例"""
def comprehensive_preprocessing(text, language='en'):
"""综合文本预处理函数"""
import re
import string
# 1. 去除HTML标签
html_pattern = r'<[^>]+>'
text = re.sub(html_pattern, '', text)
# 2. 去除URL
url_pattern = r'https?://\S+|www\.\S+'
text = re.sub(url_pattern, '', text)
# 3. 大小写转换(英文)
if language == 'en':
text = text.lower()
# 4. 处理电子邮件地址
email_pattern = r'[\w\.-]+@[\w\.-]+\.\w+'
text = re.sub(email_pattern, '', text)
# 5. 处理数字(根据需求决定是否保留)
# 这里选择将数字替换为特殊标记
text = re.sub(r'\d+', '[NUM]', text)
# 6. 处理特殊字符
# 保留字母、数字和空格,其余替换为空格
if language == 'en':
text = re.sub(r'[^\w\s]', ' ', text)
else:
# 对于其他语言,可能需要更复杂的处理
pass
# 7. 分词
if language == 'en':
words = text.split()
elif language == 'zh':
# 中文分词
try:
import jieba
words = jieba.lcut(text)
except ImportError:
print("警告: jieba未安装,使用简单字符切分")
words = list(text)
else:
words = text.split()
# 8. 停用词过滤
if language == 'en':
stop_words = {
'the', 'a', 'an', 'and', 'or', 'but', 'if', 'because', 'as', 'what',
'when', 'where', 'how', 'who', 'which', 'this', 'that', 'these',
'those', 'then', 'just', 'so', 'than', 'such', 'both', 'through',
'about', 'for', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
'will', 'would', 'could', 'should', 'can', 'may', 'might', 'must',
'shall', 'to', 'from', 'by', 'on', 'in', 'at', 'with', 'about',
'against', 'between', 'into', 'through', 'during', 'before',
'after', 'above', 'below', 'of', 'off', 'over', 'under', 'again',
'further', 'then', 'once', 'here', 'there', 'all', 'any', 'both',
'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',
'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
'[NUM]' # 移除我们添加的数字标记
}
elif language == 'zh':
stop_words = {
'的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都',
'一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会',
'着', '没有', '看', '好', '自己', '这', '[NUM]'
}
else:
stop_words = set()
words = [word for word in words if word not in stop_words and len(word) > 1]
# 9. 词形还原(英文)
if language == 'en':
try:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
except ImportError:
print("警告: nltk未安装,跳过词形还原")
return ' '.join(words)
# 测试预处理函数
example_texts = {
'en': "TF-IDF is a <strong>popular</strong> method for text analysis. Visit https://example.com for more info! Contact us at info@example.com or call 123-456-7890.",
'zh': "TF-IDF是一种常用于文本分析的方法。请访问https://example.com获取更多信息!您可以通过info@example.com联系我们,或拨打123-456-7890。"
}
print("文本预处理最佳实践示例:")
for lang, text in example_texts.items():
print(f"\n原始{'英文' if lang == 'en' else '中文'}文本:")
print(text)
processed_text = comprehensive_preprocessing(text, lang)
print(f"预处理后{'英文' if lang == 'en' else '中文'}文本:")
print(processed_text)
# text_preprocessing_best_practices() # 取消注释运行示例TF-IDF有多个参数可以调整,以下是参数调优的系统指南:
def tfidf_parameter_tuning_guide():
"""TF-IDF参数调优指南示例"""
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
# 加载数据集
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
# 分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
newsgroups.data, newsgroups.target, test_size=0.3, random_state=42
)
# 创建管道
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB())
])
# 定义参数网格
param_grid = {
'tfidf__max_df': [0.7, 0.8, 0.9],
'tfidf__min_df': [1, 2, 3],
'tfidf__ngram_range': [(1, 1), (1, 2)],
'tfidf__sublinear_tf': [True, False],
'tfidf__max_features': [None, 10000, 20000]
}
print("TF-IDF参数调优示例:")
print("注意:完整的网格搜索可能需要较长时间,这里展示关键参数的调优思路")
# 在实际应用中,可以使用以下代码进行网格搜索
# grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
# grid_search.fit(X_train, y_train)
# print("最佳参数:", grid_search.best_params_)
# print("最佳交叉验证分数:", grid_search.best_score_)
#
# # 在测试集上评估最佳模型
# best_model = grid_search.best_estimator_
# y_pred = best_model.predict(X_test)
# print("测试集准确率:", accuracy_score(y_test, y_pred))
# 不同参数组合的效果对比示例
sample_params = [
{'max_df': 0.8, 'min_df': 2, 'ngram_range': (1, 1), 'sublinear_tf': False},
{'max_df': 0.8, 'min_df': 2, 'ngram_range': (1, 2), 'sublinear_tf': True},
{'max_df': 0.9, 'min_df': 1, 'ngram_range': (1, 2), 'sublinear_tf': True}
]
print("\n不同参数组合的性能对比:")
for i, params in enumerate(sample_params):
# 配置TF-IDF
vectorizer = TfidfVectorizer(
stop_words='english',
max_df=params['max_df'],
min_df=params['min_df'],
ngram_range=params['ngram_range'],
sublinear_tf=params['sublinear_tf']
)
# 转换特征
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# 训练分类器
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
# 评估
train_score = accuracy_score(y_train, clf.predict(X_train_tfidf))
test_score = accuracy_score(y_test, clf.predict(X_test_tfidf))
print(f"\n参数组合 {i+1}: {params}")
print(f"词汇表大小: {len(vectorizer.get_feature_names_out())}")
print(f"训练集准确率: {train_score:.4f}")
print(f"测试集准确率: {test_score:.4f}")
# tfidf_parameter_tuning_guide() # 取消注释运行示例在使用TF-IDF时,可能会遇到一些常见问题,以下是相应的解决方案:
def tfidf_common_issues():
"""TF-IDF常见问题与解决方案示例"""
# 1. 处理特征维度太高的问题
def high_dimensionality_solution():
print("\n1. 处理特征维度太高的问题:")
print(" - 方法1: 调整max_df和min_df参数")
print(" - 方法2: 使用max_features限制特征数量")
print(" - 方法3: 应用降维技术")
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# 示例文档
documents = [
"This is a sample document with many words." * 10, # 模拟包含重复词的文档
"Another document with different content." * 10,
"Yet another document with unique terms." * 10,
"Document that discusses various topics and concepts." * 10
]
# 原始TF-IDF
vectorizer = TfidfVectorizer()
original_tfidf = vectorizer.fit_transform(documents)
print(f"\n原始特征维度: {original_tfidf.shape[1]}")
# 调整参数降低维度
vectorizer_tuned = TfidfVectorizer(max_df=0.9, min_df=2, max_features=20)
tuned_tfidf = vectorizer_tuned.fit_transform(documents)
print(f"调整参数后的特征维度: {tuned_tfidf.shape[1]}")
print(f"保留的特征: {vectorizer_tuned.get_feature_names_out()}")
# 使用SVD降维
svd = TruncatedSVD(n_components=5)
reduced_tfidf = svd.fit_transform(original_tfidf)
print(f"SVD降维后的特征维度: {reduced_tfidf.shape[1]}")
# 2. 处理中英文混合文本
def mixed_language_solution():
print("\n2. 处理中英文混合文本:")
print(" - 方法1: 分别处理中英文部分")
print(" - 方法2: 使用多语言分词工具")
# 示例混合文本
mixed_texts = [
"TF-IDF是一种popular的文本表示方法。",
"深度学习在NLP领域取得了significant进展。",
"This is a 示例文档 with mixed English and Chinese."
]
try:
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
# 自定义分词器,处理中英文混合文本
def mixed_language_tokenizer(text):
# 分离中英文
import re
# 匹配中文字符
chinese_chars = re.findall(r'[\u4e00-\u9fa5]+', text)
# 匹配英文单词和数字
english_parts = re.findall(r'[a-zA-Z0-9]+', text)
# 对中文部分进行分词
chinese_tokens = []
for char_group in chinese_chars:
if len(char_group) > 1: # 只有长度大于1的中文才分词
chinese_tokens.extend(jieba.lcut(char_group))
else:
chinese_tokens.append(char_group)
# 合并所有token
return chinese_tokens + english_parts
# 创建使用自定义分词器的TF-IDF向量器
vectorizer = TfidfVectorizer(tokenizer=mixed_language_tokenizer)
tfidf_matrix = vectorizer.fit_transform(mixed_texts)
print(f"混合文本TF-IDF词汇表大小: {len(vectorizer.get_feature_names_out())}")
print(f"词汇表: {vectorizer.get_feature_names_out()}")
print(f"TF-IDF矩阵形状: {tfidf_matrix.shape}")
except ImportError:
print("请安装jieba库: pip install jieba")
print("TF-IDF常见问题与解决方案:")
high_dimensionality_solution()
mixed_language_solution()
# tfidf_common_issues() # 取消注释运行示例TF-IDF作为一种经典的文本表示方法,尽管已经存在数十年,但其简洁性、高效性和可解释性使其仍然在现代NLP工具箱中占有重要位置。本文从以下几个方面全面介绍了TF-IDF技术:
尽管深度学习和预训练语言模型在NLP领域取得了巨大成功,TF-IDF在未来的NLP生态中仍将扮演重要角色:
对于希望学习和应用TF-IDF技术的读者,我们提供以下建议:
TF-IDF虽然是一种相对简单的技术,但它的思想和应用价值是深远的。通过本文的学习,希望读者能够全面掌握TF-IDF技术,并在实际项目中灵活应用,创造价值。随着NLP技术的不断发展,我们有理由相信,TF-IDF将继续演变和适应新的挑战,在文本分析领域发挥重要作用。
# 完整的TF-IDF应用示例汇总
def tfidf_complete_workflow():
"""TF-IDF完整工作流程示例"""
print("TF-IDF完整工作流程示例")
print("1. 文本预处理")
print("2. TF-IDF特征提取")
print("3. 参数调优")
print("4. 特征应用(分类、聚类等)")
print("5. 结果评估与可视化")
# 完整代码可以结合前面各节的示例实现
# tfidf_complete_workflow() # 取消注释运行完整工作流示例