前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >开窗函数 spark sql实现累加、累积计数、累乘

开窗函数 spark sql实现累加、累积计数、累乘

作者头像
数据仓库晨曦
发布2025-03-04 14:22:15
发布2025-03-04 14:22:15
8700
代码可运行
举报
文章被收录于专栏:数据仓库技术数据仓库技术
运行总次数:0
代码可运行

前面对开窗函数做了基础知识介绍,今天我们对其应用做详细介绍。通过开窗函数实现累积求和(累加),累积计数,累乘(累积相乘)。

有一定难度,建议先思考然后再阅读

1.样例数据

代码语言:javascript
代码运行次数:0
复制
+-----+-----------+---------+-------------+
| id  | group_id  | amount  |   c_date    |
+-----+-----------+---------+-------------+
| 1   | a         | 1.00    | 2025-01-01  |
| 2   | a         | 1.00    | 2025-01-02  |
| 3   | a         | 2.00    | 2025-01-03  |
| 4   | a         | 3.00    | 2025-01-03  |
| 5   | b         | -1.00   | 2025-01-01  |
| 6   | b         | 1.00    | 2025-01-02  |
| 7   | b         | 0.50    | 2025-01-03  |
| 8   | b         | 2.00    | 2025-01-03  |
| 9   | b         | 1.00    | 2025-01-03  |
+-----+-----------+---------+-------------+

建表语句及数据插入语句

代码语言:javascript
代码运行次数:0
复制
CREATE TABLE t_accure
(
    id       BIGINT,
    group_id STRING,
    amount   decimal(16, 2),
    c_date   STRING
);

INSERT INTO t_accure
VALUES (1,'a',1,'2025-01-01'),
       (2,'a',1,'2025-01-02'),
       (3,'a',2,'2025-01-03'),
       (4,'a',3,'2025-01-03'),
       (5,'b',-1,'2025-01-01'),
       (6,'b',1,'2025-01-02'),
       (7,'b',0.5,'2025-01-03'),
       (8,'b',2,'2025-01-03'),
       (9,'b',1,'2025-01-03');

2. 累加

题目

按照group_id进行分组,根据c_date顺序从早到晚对amount进行累积求和。

累加有两种方式,在对于排序相同的数据进行累加的时候,由于排序相同(且随机),所以出现两种方式:1.计算到当前行;2.计算到与当前排序值相同的最后一行。

代码语言:javascript
代码运行次数:0
复制
select id,
       group_id,
       amount,
       c_date,
       sum(amount)
           over (partition by group_id order by c_date rows between unbounded preceding and current row) as accure_add1,
        sum(amount)
           over (partition by group_id order by c_date range between unbounded preceding and current row) as accure_add2,
        sum(amount)
           over (partition by group_id order by c_date) as accure_add3
from t_accure

执行结果

代码语言:javascript
代码运行次数:0
复制
+-----+-----------+---------+-------------+--------------+--------------+--------------+
| id  | group_id  | amount  |   c_date    | accure_add1  | accure_add2  | accure_add3  |
+-----+-----------+---------+-------------+--------------+--------------+--------------+
| 1   | a         | 1.00    | 2025-01-01  | 1.00         | 1.00         | 1.00         |
| 2   | a         | 1.00    | 2025-01-02  | 2.00         | 2.00         | 2.00         |
| 3   | a         | 2.00    | 2025-01-03  | 4.00         | 7.00         | 7.00         |
| 4   | a         | 3.00    | 2025-01-03  | 7.00         | 7.00         | 7.00         |
| 5   | b         | -1.00   | 2025-01-01  | -1.00        | -1.00        | -1.00        |
| 6   | b         | 1.00    | 2025-01-02  | 0.00         | 0.00         | 0.00         |
| 7   | b         | 0.50    | 2025-01-03  | 0.50         | 3.50         | 3.50         |
| 8   | b         | 2.00    | 2025-01-03  | 2.50         | 3.50         | 3.50         |
| 9   | b         | 1.00    | 2025-01-03  | 3.50         | 3.50         | 3.50         |
+-----+-----------+---------+-------------+--------------+--------------+--------------+

注意:

  • 1.注意第3行和4行的结果;
  • 2.注意滴7,8,9行的结果;
  • 3.注意窗口框架方位省略后的结果:accure_add3

3.累积计数

题目 1.按照group_id进行分组,根据c_date顺序从早到晚对c_date进行累积计数; 2.按照group_id进行分组,根据c_date顺序从早到晚对c_date进行累积计数,要求去重;

3.1累积计数

代码语言:javascript
代码运行次数:0
复制
select id,
       group_id,
       amount,
       c_date,
       count(c_date)
             over (partition by group_id order by c_date rows between unbounded preceding and current row )  as accure_count1,
       count(c_date)
             over (partition by group_id order by c_date range between unbounded preceding and current row ) as accure_count2,
       count(c_date)
             over (partition by group_id order by c_date )                                                   as accure_count3
from t_accure

执行结果

代码语言:javascript
代码运行次数:0
复制
+-----+-----------+---------+-------------+----------------+----------------+----------------+
| id  | group_id  | amount  |   c_date    | accure_count1  | accure_count2  | accure_count3  |
+-----+-----------+---------+-------------+----------------+----------------+----------------+
| 1   | a         | 1.00    | 2025-01-01  | 1              | 1              | 1              |
| 2   | a         | 1.00    | 2025-01-02  | 2              | 2              | 2              |
| 3   | a         | 2.00    | 2025-01-03  | 3              | 4              | 4              |
| 4   | a         | 3.00    | 2025-01-03  | 4              | 4              | 4              |
| 5   | b         | -1.00   | 2025-01-01  | 1              | 1              | 1              |
| 6   | b         | 1.00    | 2025-01-02  | 2              | 2              | 2              |
| 7   | b         | 0.50    | 2025-01-03  | 3              | 5              | 5              |
| 8   | b         | 2.00    | 2025-01-03  | 4              | 5              | 5              |
| 9   | b         | 1.00    | 2025-01-03  | 5              | 5              | 5              |
+-----+-----------+---------+-------------+----------------+----------------+----------------+

3.2 累积去重计数

我们首先想到的是直接使用count(disitnct amount) 的方式来完成,注意:在hive中支持count(distinct amount)over() 这种方式,但是在spark中不支持这种写法.

count(distinct amount)over()方式实现

代码语言:javascript
代码运行次数:0
复制
select id,
       group_id,
       amount,
       c_date,
       count(distinct c_date)
           over (partition by group_id order by c_date rows between unbounded preceding and current row ) as accure_count
from t_accure

collect_set方式实现

代码语言:javascript
代码运行次数:0
复制
select id,
       group_id,
       amount,
       c_date,
       size(c_date_set) as accure_count
from (select id,
             group_id,
             amount,
             c_date,
             collect_set(c_date)
                         over (partition by group_id order by c_date rows between unbounded preceding and current row ) as c_date_set
      from t_accure) t

执行结果

代码语言:javascript
代码运行次数:0
复制
+-----+-----------+---------+-------------+---------------+
| id  | group_id  | amount  |   c_date    | accure_count  |
+-----+-----------+---------+-------------+---------------+
| 1   | a         | 1.00    | 2025-01-01  | 1             |
| 2   | a         | 1.00    | 2025-01-02  | 2             |
| 3   | a         | 2.00    | 2025-01-03  | 3             |
| 4   | a         | 3.00    | 2025-01-03  | 3             |
| 5   | b         | -1.00   | 2025-01-01  | 1             |
| 6   | b         | 1.00    | 2025-01-02  | 2             |
| 7   | b         | 0.50    | 2025-01-03  | 3             |
| 8   | b         | 2.00    | 2025-01-03  | 3             |
| 9   | b         | 1.00    | 2025-01-03  | 3             |
+-----+-----------+---------+-------------+---------------+

4.累乘(累积相乘)

题目 按照group_id进行分组,根据c_date顺序从早到晚对amount进行累积相乘;

代码语言:javascript
代码运行次数:0
复制
select id,
       group_id,
       amount,
       c_date,
       amount_list,
       aggregate(amount_list, cast(1 as decimal(16, 2)), (amount_list, x) -> cast(amount_list * x as decimal(16,
                 2))) as accure_mul
from (select id,
             group_id,
             amount,
             c_date,
             collect_list(amount)
                          over (partition by group_id order by c_date rows between unbounded preceding and current row ) as amount_list
      from t_accure) t
代码语言:javascript
代码运行次数:0
复制
+-----+-----------+---------+-------------+------------------------------+-------------+
| id  | group_id  | amount  |   c_date    |         amount_list          | accure_mul  |
+-----+-----------+---------+-------------+------------------------------+-------------+
| 1   | a         | 1.00    | 2025-01-01  | [1.00]                       | 1.00        |
| 2   | a         | 1.00    | 2025-01-02  | [1.00,1.00]                  | 1.00        |
| 3   | a         | 2.00    | 2025-01-03  | [1.00,1.00,2.00]             | 2.00        |
| 4   | a         | 3.00    | 2025-01-03  | [1.00,1.00,2.00,3.00]        | 6.00        |
| 5   | b         | -1.00   | 2025-01-01  | [-1.00]                      | -1.00       |
| 6   | b         | 1.00    | 2025-01-02  | [-1.00,1.00]                 | -1.00       |
| 7   | b         | 0.50    | 2025-01-03  | [-1.00,1.00,0.50]            | -0.50       |
| 8   | b         | 2.00    | 2025-01-03  | [-1.00,1.00,0.50,2.00]       | -1.00       |
| 9   | b         | 1.00    | 2025-01-03  | [-1.00,1.00,0.50,2.00,1.00]  | -1.00       |
+-----+-----------+---------+-------------+------------------------------+-------------+

5 涉及函数

  • sum https://sparkfunctions.com/sum
  • count https://sparkfunctions.com/count
  • collect_set https://sparkfunctions.com/collect_set
  • collect_list https://sparkfunctions.com/collect_list
  • aggregate https://sparkfunctions.com/aggregate
  • size https://sparkfunctions.com/size
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2025-03-03,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据仓库技术 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1.样例数据
  • 2. 累加
  • 3.累积计数
    • 3.1累积计数
    • 3.2 累积去重计数
  • 4.累乘(累积相乘)
  • 5 涉及函数
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档