文章/答案/技术大牛

发布

社区首页 >问答首页 >spark dataframe根据条件从多列中选择值

问spark dataframe根据条件从多列中选择值
EN

Stack Overflow用户

提问于 2019-11-25 01:09:40

回答 1查看 1.7K关注 0票数 0

数据模式，

root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)


|id|col1         |col2               |
|1 |["x","y","z"]|[123,"null","null"]|

从上面的数据中，我想过滤出x在col1中的退出位置以及col2中x的相应值。( col1中的col1和col2 ordered.If x索引2的值以及col2中的值索引也是2)

结果：(需要col1和col2类型的数组类型)

|id |col1 |col2 |
|1  |["x"]|[123]|

如果x在col1中不存在，则需要如下结果

|id| col1    |col2 |
|1 |["null"] |["null"]|

我试过了

val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))

apache-spark

apache-spark-sql

回答 1

Stack Overflow用户

发布于 2019-11-25 05:44:26

诀窍是将数据从愚蠢的string列转换为更有用的数据结构。将col1和col2重新构建为数组(或映射，正如您希望的输出所示)后，您可以使用Spark的内置函数，而不是@baitmbarek建议的混乱的UDF。

首先，使用trim和split将col1和col2转换为数组：

scala> val df = Seq(
     |       ("1", """["x","y","z"]""","""[123,"null","null"]"""),
     |         ("2", """["a","y","z"]""","""[123,"null","null"]""")
     |     ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]

scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
                        .withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_array.show(false)
+---+---------+-----------------+
|id |col1     |col2             |
+---+---------+-----------------+
|1  |[x, y, z]|[123, null, null]|
|2  |[a, y, z]|[123, null, null]|
+---+---------+-----------------+


scala> df_array.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

现在，您应该能够使用array_position在col1中找到'x‘的索引(如果有的话)并从col2检索匹配的数据，从而实现您想要的结果。但是，首先将这两个数组转换为map，应该可以更清楚地理解您的代码在做什么：

scala> val df_map = df_array.select(
                        $"id", 
                        map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
                        )
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]

scala> df_map.show(false)
+---+--------------------------------+
|id |col_map                         |
+---+--------------------------------+
|1  |[x -> 123, y -> null, z -> null]|
|2  |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
                                $"id",
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(lit("x")))
                                .as("col1"),  
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(element_at($"col_map", "x")))
                                .as("col2")
                                )
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_final.show
+---+------+------+
| id|  col1|  col2|
+---+------+------+
|  1|   [x]| [123]|
|  2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- col2: array (nullable = false)
 |    |-- element: string (containsNull = true)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59020264

复制

相似问题

问spark dataframe根据条件从多列中选择值
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spark dataframe根据条件从多列中选择值EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spark dataframe根据条件从多列中选择值
EN