Spark DataFrame是Spark中一种基于分布式数据集的数据结构,它提供了丰富的API和功能,用于处理结构化和半结构化数据。TF-IDF(Term Frequency-Inverse Document Frequency)是一种常用的文本特征提取方法,用于衡量一个词在文档中的重要程度。
在Spark DataFrame中计算TF-IDF并输出余弦相似度,可以按照以下步骤进行:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
val spark = SparkSession.builder().appName("TF-IDF Example").getOrCreate()
val sentenceData = spark.createDataFrame(Seq(
(0, "I love Spark"),
(1, "I love Scala"),
(2, "I love Spark and Scala")
)).toDF("id", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
val vecPairRdd = rescaledData.select("id", "features").rdd.map { case Row(id: Int, features: Vector) => (id, features) }
val vecPairRddCartesian = vecPairRdd.cartesian(vecPairRdd)
val cosineSimilarityRdd = vecPairRddCartesian.map { case ((id1, vec1), (id2, vec2)) =>
val cosineSimilarity = vec1.dot(vec2) / (vec1.norm(2) * vec2.norm(2))
(id1, id2, cosineSimilarity)
}
cosineSimilarityRdd.collect().foreach { case (id1, id2, cosineSimilarity) =>
println(s"($id1, $id2) -> similarity: $cosineSimilarity")
}
以上代码演示了如何在Spark DataFrame中计算TF-IDF并输出余弦相似度。在实际应用中,可以根据具体需求进行调整和扩展。
推荐的腾讯云相关产品和产品介绍链接地址:
领取专属 10元无门槛券
手把手带您无忧上云