我有两个数据集details和reference details。详细信息
代码日期位置温度
1 1-1-19 blr 30
2 1-2-18 up 33
3 1-2-18 dlh 25refrenceDetails
代码日期位置
1 1-1-19亿
2 1-2-18向上
如果代码存在于referenceDetails数据集中,我希望将详细信息数据集中的记录作为有效详细信息进行筛选,否则立即将其作为无效详细信息进行筛选
我试着做内部连接和left_anti连接。但我必须加入两次。有没有办法避免两次连接
val invalidRecords = detailsDS.join(referencedetailsDS,Seq(Code),"left_anti")
val invalidRecords = detailsDS.join(referencedetailsDS,Seq(Code),"inner")
Valid Details
#code date location temperature
1 1-1-19 blr 30
2 1-2-18 up 33
Invalid Details
code date location temperature
3 1-2-18 dlh 25发布于 2019-03-31 06:22:57
使用左连接:
// data
val details = Seq(
(1, "1-1-19", "blr", 30),
(2, "1-2-18", "up", 33),
(3, "1-2-18", "dlh", 25)
).toDF("code", "date", "location", "temperature")
val refrenceDetails = Seq(
(1, "1-1-19", "blr"),
(2, "1-2-18", "up")).
toDF("code", "date", "location")
// action
val joined = details.alias("d").join(refrenceDetails.alias("r"), Seq("code"), "left")
val validDetails = joined.where($"r.code".isNotNull)
val invalidDetails = joined.where($"r.code".isNull)
// display
validDetails.show(false)
invalidDetails.show(false)输出:
+----+------+--------+-----------+------+--------+
|code|date |location|temperature|date |location|
+----+------+--------+-----------+------+--------+
|1 |1-1-19|blr |30 |1-1-19|blr |
|2 |1-2-18|up |33 |1-2-18|up |
+----+------+--------+-----------+------+--------+
+----+------+--------+-----------+----+--------+
|code|date |location|temperature|date|location|
+----+------+--------+-----------+----+--------+
|3 |1-2-18|dlh |25 |null|null |
+----+------+--------+-----------+----+--------+https://stackoverflow.com/questions/55434579
复制相似问题