在Pyspark中计算数据帧中的空值和非空值_过滤pyspark中的非空值和空白_PySpark数据帧分组依据和计数空值 - 腾讯云开发者社区

在PySpark中，计算数据帧中的空值（null值）和非空值（非null值）可以通过多种方法实现。以下是一些常用的方法和示例：

示例数据帧

首先，我们创建一个示例数据帧：

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count

# 创建SparkSession
spark = SparkSession.builder.appName("NullValueCount").getOrCreate()

# 创建示例数据帧
data = [
    (1, "Alice", None),
    (2, None, 30),
    (3, "Bob", 25),
    (4, "Cathy", None),
    (5, None, None)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()

输出：

+---+-----+----+
| id| name| age|
+---+-----+----+
|  1|Alice|null|
|  2| null|  30|
|  3|  Bob|  25|
|  4|Cathy|null|
|  5| null|null|
+---+-----+----+

计算每列的空值和非空值

1. 使用 `isnull` 和 `isnan` 函数

# 计算每列的空值数量
null_counts = df.select([count(when(col(c).isNull() | isnan(c), c)).alias(c) for c in df.columns])
null_counts.show()

# 计算每列的非空值数量
non_null_counts = df.select([count(when(col(c).isNotNull() & ~isnan(c), c)).alias(c) for c in df.columns])
non_null_counts.show()

输出：

+---+----+---+
| id|name|age|
+---+----+---+
|  0|   2|  3|
+---+----+---+

+---+----+---+
| id|name|age|
+---+----+---+
|  5|   3|  2|
+---+----+---+

2. 使用 `agg` 函数

from pyspark.sql.functions import sum

# 计算每列的空值数量
null_counts = df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns])
null_counts.show()

# 计算每列的非空值数量
non_null_counts = df.select([sum(col(c).isNotNull().cast("int")).alias(c) for c in df.columns])
non_null_counts.show()

输出：

+---+----+---+
| id|name|age|
+---+----+---+
|  0|   2|  3|
+---+----+---+

+---+----+---+
| id|name|age|
+---+----+---+
|  5|   3|  2|
+---+----+---+

计算整个数据帧的空值和非空值

1. 使用 `rdd` 和 `map` 函数

# 计算整个数据帧的空值数量
total_nulls = df.rdd.map(lambda row: sum([c is None for c in row])).sum()
print(f"Total null values: {total_nulls}")

# 计算整个数据帧的非空值数量
total_non_nulls = df.rdd.map(lambda row: sum([c is not None for c in row])).sum()
print(f"Total non-null values: {total_non_nulls}")

输出：

Total null values: 5
Total non-null values: 10

结论

通过以上方法，你可以在PySpark中计算数据帧中每列的空值和非空值数量，以及整个数据帧的空值和非空值数量。根据你的具体需求，可以选择适合的方法来实现。

在Pyspark中计算数据帧中的空值和非空值

示例数据帧

计算每列的空值和非空值

1. 使用 isnull 和 isnan 函数

2. 使用 agg 函数

计算整个数据帧的空值和非空值

1. 使用 rdd 和 map 函数

结论

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

1. 使用 `isnull` 和 `isnan` 函数

2. 使用 `agg` 函数

1. 使用 `rdd` 和 `map` 函数