前面对开窗函数做了基础知识介绍,今天我们对其应用做详细介绍。通过开窗函数实现累积求和(累加),累积计数,累乘(累积相乘)。
有一定难度,建议先思考然后再阅读
+-----+-----------+---------+-------------+
| id | group_id | amount | c_date |
+-----+-----------+---------+-------------+
| 1 | a | 1.00 | 2025-01-01 |
| 2 | a | 1.00 | 2025-01-02 |
| 3 | a | 2.00 | 2025-01-03 |
| 4 | a | 3.00 | 2025-01-03 |
| 5 | b | -1.00 | 2025-01-01 |
| 6 | b | 1.00 | 2025-01-02 |
| 7 | b | 0.50 | 2025-01-03 |
| 8 | b | 2.00 | 2025-01-03 |
| 9 | b | 1.00 | 2025-01-03 |
+-----+-----------+---------+-------------+
建表语句及数据插入语句
CREATE TABLE t_accure
(
id BIGINT,
group_id STRING,
amount decimal(16, 2),
c_date STRING
);
INSERT INTO t_accure
VALUES (1,'a',1,'2025-01-01'),
(2,'a',1,'2025-01-02'),
(3,'a',2,'2025-01-03'),
(4,'a',3,'2025-01-03'),
(5,'b',-1,'2025-01-01'),
(6,'b',1,'2025-01-02'),
(7,'b',0.5,'2025-01-03'),
(8,'b',2,'2025-01-03'),
(9,'b',1,'2025-01-03');
题目
按照group_id进行分组,根据c_date顺序从早到晚对amount进行累积求和。
累加有两种方式,在对于排序相同的数据进行累加的时候,由于排序相同(且随机),所以出现两种方式:1.计算到当前行;2.计算到与当前排序值相同的最后一行。
select id,
group_id,
amount,
c_date,
sum(amount)
over (partition by group_id order by c_date rows between unbounded preceding and current row) as accure_add1,
sum(amount)
over (partition by group_id order by c_date range between unbounded preceding and current row) as accure_add2,
sum(amount)
over (partition by group_id order by c_date) as accure_add3
from t_accure
执行结果
+-----+-----------+---------+-------------+--------------+--------------+--------------+
| id | group_id | amount | c_date | accure_add1 | accure_add2 | accure_add3 |
+-----+-----------+---------+-------------+--------------+--------------+--------------+
| 1 | a | 1.00 | 2025-01-01 | 1.00 | 1.00 | 1.00 |
| 2 | a | 1.00 | 2025-01-02 | 2.00 | 2.00 | 2.00 |
| 3 | a | 2.00 | 2025-01-03 | 4.00 | 7.00 | 7.00 |
| 4 | a | 3.00 | 2025-01-03 | 7.00 | 7.00 | 7.00 |
| 5 | b | -1.00 | 2025-01-01 | -1.00 | -1.00 | -1.00 |
| 6 | b | 1.00 | 2025-01-02 | 0.00 | 0.00 | 0.00 |
| 7 | b | 0.50 | 2025-01-03 | 0.50 | 3.50 | 3.50 |
| 8 | b | 2.00 | 2025-01-03 | 2.50 | 3.50 | 3.50 |
| 9 | b | 1.00 | 2025-01-03 | 3.50 | 3.50 | 3.50 |
+-----+-----------+---------+-------------+--------------+--------------+--------------+
注意:
题目 1.按照group_id进行分组,根据c_date顺序从早到晚对c_date进行累积计数; 2.按照group_id进行分组,根据c_date顺序从早到晚对c_date进行累积计数,要求去重;
select id,
group_id,
amount,
c_date,
count(c_date)
over (partition by group_id order by c_date rows between unbounded preceding and current row ) as accure_count1,
count(c_date)
over (partition by group_id order by c_date range between unbounded preceding and current row ) as accure_count2,
count(c_date)
over (partition by group_id order by c_date ) as accure_count3
from t_accure
执行结果
+-----+-----------+---------+-------------+----------------+----------------+----------------+
| id | group_id | amount | c_date | accure_count1 | accure_count2 | accure_count3 |
+-----+-----------+---------+-------------+----------------+----------------+----------------+
| 1 | a | 1.00 | 2025-01-01 | 1 | 1 | 1 |
| 2 | a | 1.00 | 2025-01-02 | 2 | 2 | 2 |
| 3 | a | 2.00 | 2025-01-03 | 3 | 4 | 4 |
| 4 | a | 3.00 | 2025-01-03 | 4 | 4 | 4 |
| 5 | b | -1.00 | 2025-01-01 | 1 | 1 | 1 |
| 6 | b | 1.00 | 2025-01-02 | 2 | 2 | 2 |
| 7 | b | 0.50 | 2025-01-03 | 3 | 5 | 5 |
| 8 | b | 2.00 | 2025-01-03 | 4 | 5 | 5 |
| 9 | b | 1.00 | 2025-01-03 | 5 | 5 | 5 |
+-----+-----------+---------+-------------+----------------+----------------+----------------+
我们首先想到的是直接使用count(disitnct amount) 的方式来完成,注意:在hive中支持count(distinct amount)over() 这种方式,但是在spark中不支持这种写法.
count(distinct amount)over()方式实现
select id,
group_id,
amount,
c_date,
count(distinct c_date)
over (partition by group_id order by c_date rows between unbounded preceding and current row ) as accure_count
from t_accure
collect_set方式实现
select id,
group_id,
amount,
c_date,
size(c_date_set) as accure_count
from (select id,
group_id,
amount,
c_date,
collect_set(c_date)
over (partition by group_id order by c_date rows between unbounded preceding and current row ) as c_date_set
from t_accure) t
执行结果
+-----+-----------+---------+-------------+---------------+
| id | group_id | amount | c_date | accure_count |
+-----+-----------+---------+-------------+---------------+
| 1 | a | 1.00 | 2025-01-01 | 1 |
| 2 | a | 1.00 | 2025-01-02 | 2 |
| 3 | a | 2.00 | 2025-01-03 | 3 |
| 4 | a | 3.00 | 2025-01-03 | 3 |
| 5 | b | -1.00 | 2025-01-01 | 1 |
| 6 | b | 1.00 | 2025-01-02 | 2 |
| 7 | b | 0.50 | 2025-01-03 | 3 |
| 8 | b | 2.00 | 2025-01-03 | 3 |
| 9 | b | 1.00 | 2025-01-03 | 3 |
+-----+-----------+---------+-------------+---------------+
题目 按照group_id进行分组,根据c_date顺序从早到晚对amount进行累积相乘;
select id,
group_id,
amount,
c_date,
amount_list,
aggregate(amount_list, cast(1 as decimal(16, 2)), (amount_list, x) -> cast(amount_list * x as decimal(16,
2))) as accure_mul
from (select id,
group_id,
amount,
c_date,
collect_list(amount)
over (partition by group_id order by c_date rows between unbounded preceding and current row ) as amount_list
from t_accure) t
+-----+-----------+---------+-------------+------------------------------+-------------+
| id | group_id | amount | c_date | amount_list | accure_mul |
+-----+-----------+---------+-------------+------------------------------+-------------+
| 1 | a | 1.00 | 2025-01-01 | [1.00] | 1.00 |
| 2 | a | 1.00 | 2025-01-02 | [1.00,1.00] | 1.00 |
| 3 | a | 2.00 | 2025-01-03 | [1.00,1.00,2.00] | 2.00 |
| 4 | a | 3.00 | 2025-01-03 | [1.00,1.00,2.00,3.00] | 6.00 |
| 5 | b | -1.00 | 2025-01-01 | [-1.00] | -1.00 |
| 6 | b | 1.00 | 2025-01-02 | [-1.00,1.00] | -1.00 |
| 7 | b | 0.50 | 2025-01-03 | [-1.00,1.00,0.50] | -0.50 |
| 8 | b | 2.00 | 2025-01-03 | [-1.00,1.00,0.50,2.00] | -1.00 |
| 9 | b | 1.00 | 2025-01-03 | [-1.00,1.00,0.50,2.00,1.00] | -1.00 |
+-----+-----------+---------+-------------+------------------------------+-------------+