昨天文章发出后,发现写错了所以删除文章,今天修复后重新发出来,感谢指出错误的朋友。这里还是按照原有的解题思路进行处理,细节处会给出昨天错误地方的对比。
已知有表记录了每个大厅的活动开始日期和结束日期,每个大厅可以有多个活动。请编写一个SQL查询合并在同一个大厅举行的所有重叠的活动,如果两个活动至少有一天相同,那他们就是重叠的
样例数据
+----------+-------------+-------------+
| hall_id | start_date | end_date |
+----------+-------------+-------------+
| 1 | 2023-01-13 | 2023-01-20 |
| 1 | 2023-01-14 | 2023-01-17 |
| 1 | 2023-01-14 | 2023-01-16 |
| 1 | 2023-01-18 | 2023-01-25 |
| 1 | 2023-01-20 | 2023-01-26 |
| 2 | 2022-12-09 | 2022-12-23 |
| 2 | 2022-12-13 | 2022-12-17 |
| 2 | 2022-12-20 | 2022-12-24 |
| 2 | 2022-12-25 | 2022-12-30 |
| 3 | 2022-12-01 | 2023-01-30 |
+----------+-------------+-------------+
结果
+----------+-------------+-------------+
| hall_id | start_date | end_date |
+----------+-------------+-------------+
| 1 | 2023-01-13 | 2023-01-26 |
| 2 | 2022-12-09 | 2022-12-24 |
| 2 | 2022-12-25 | 2022-12-30 |
| 3 | 2022-12-01 | 2023-01-30 |
+----------+-------------+-------------+
解释:有三个大厅。 大厅 1:
大厅 2:
我们首先按照hall_id分组,根据start_date、end_date 升序排列,按照start_date 进行了升序排列,所以当前行的start_date一定晚于前一行的start_date,对当前行的start_date 和截止到上一行的最大end_date进行比较,如果当前行的start_date 小于等于截止到前一行最大end_date 代表有交叉,可以合并,否则代表不可合并。判断出是否可以合并之后,具体操作合并就转化成类似连续问题了。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
这里我们使用max(),按照hall_id进行分组,然后根据start_date和end_date进行排序,利用聚合函数开窗,开窗函数内有排序则聚合到当前行的特性,进行处理。其中我们这边需要聚合到当前行的上一行。
执行SQL
select
hall_id,
start_date,
end_date,
max(end_date) over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
from t_hall_event
执行结果
+----------+-------------+-------------+---------------+
| hall_id | start_date | end_date | max_end_date |
+----------+-------------+-------------+---------------+
| 1 | 2023-01-13 | 2023-01-20 | NULL |
| 1 | 2023-01-14 | 2023-01-16 | 2023-01-20 |
| 1 | 2023-01-14 | 2023-01-17 | 2023-01-20 |
| 1 | 2023-01-18 | 2023-01-25 | 2023-01-20 |
| 1 | 2023-01-20 | 2023-01-26 | 2023-01-25 |
| 2 | 2022-12-09 | 2022-12-23 | NULL |
| 2 | 2022-12-13 | 2022-12-17 | 2022-12-23 |
| 2 | 2022-12-20 | 2022-12-24 | 2022-12-23 |
| 2 | 2022-12-25 | 2022-12-30 | 2022-12-24 |
| 3 | 2022-12-01 | 2023-01-30 | NULL |
+----------+-------------+-------------+---------------+
昨天错误写法
这里先看下结果,这一步还不太明显,在下一步判断是否应该被合并时,可以方便看出结果。
执行SQL
select
hall_id,
start_date,
end_date,
lag(end_date) over (partition by hall_id order by start_date asc,end_date asc) as last_end_date
from t_hall_event
执行结果
+----------+-------------+-------------+----------------+
| hall_id | start_date | end_date | last_end_date |
+----------+-------------+-------------+----------------+
| 1 | 2023-01-13 | 2023-01-20 | NULL |
| 1 | 2023-01-14 | 2023-01-16 | 2023-01-20 |
| 1 | 2023-01-14 | 2023-01-17 | 2023-01-16 |
| 1 | 2023-01-18 | 2023-01-25 | 2023-01-17 |
| 1 | 2023-01-20 | 2023-01-26 | 2023-01-25 |
| 2 | 2022-12-09 | 2022-12-23 | NULL |
| 2 | 2022-12-13 | 2022-12-17 | 2022-12-23 |
| 2 | 2022-12-20 | 2022-12-24 | 2022-12-17 |
| 2 | 2022-12-25 | 2022-12-30 | 2022-12-24 |
| 3 | 2022-12-01 | 2023-01-30 | NULL |
+----------+-------------+-------------+----------------+
执行SQL
select hall_id,
start_date,
end_date,
max_end_date,
if(start_date <= max_end_date, 0, 1) as is_merge --0:合并,1:不合并
from (select hall_id,
start_date,
end_date,
max(end_date)
over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
from t_hall_event) t
执行结果
+----------+-------------+-------------+---------------+-----------+
| hall_id | start_date | end_date | max_end_date | is_merge |
+----------+-------------+-------------+---------------+-----------+
| 1 | 2023-01-13 | 2023-01-20 | NULL | 1 |
| 1 | 2023-01-14 | 2023-01-16 | 2023-01-20 | 0 |
| 1 | 2023-01-14 | 2023-01-17 | 2023-01-20 | 0 |
| 1 | 2023-01-18 | 2023-01-25 | 2023-01-20 | 0 |
| 1 | 2023-01-20 | 2023-01-26 | 2023-01-25 | 0 |
| 2 | 2022-12-09 | 2022-12-23 | NULL | 1 |
| 2 | 2022-12-13 | 2022-12-17 | 2022-12-23 | 0 |
| 2 | 2022-12-20 | 2022-12-24 | 2022-12-23 | 0 |
| 2 | 2022-12-25 | 2022-12-30 | 2022-12-24 | 1 |
| 3 | 2022-12-01 | 2023-01-30 | NULL | 1 |
+----------+-------------+-------------+---------------+-----------+
以下是昨天错误写法
执行SQL
select hall_id,
start_date,
end_date,
last_end_date,
if(start_date <= last_end_date, 0, 1) as is_merge --0:合并,1:不合并
from (select hall_id,
start_date,
end_date,
lag(end_date) over (partition by hall_id order by start_date asc,end_date asc) as last_end_date
from t_hall_event) t
执行结果
+----------+-------------+-------------+----------------+-----------+
| hall_id | start_date | end_date | last_end_date | is_merge |
+----------+-------------+-------------+----------------+-----------+
| 1 | 2023-01-13 | 2023-01-20 | NULL | 1 |
| 1 | 2023-01-14 | 2023-01-16 | 2023-01-20 | 0 |
| 1 | 2023-01-14 | 2023-01-17 | 2023-01-16 | 0 |
| 1 | 2023-01-18 | 2023-01-25 | 2023-01-17 | 1 |
| 1 | 2023-01-20 | 2023-01-26 | 2023-01-25 | 0 |
| 2 | 2022-12-09 | 2022-12-23 | NULL | 1 |
| 2 | 2022-12-13 | 2022-12-17 | 2022-12-23 | 0 |
| 2 | 2022-12-20 | 2022-12-24 | 2022-12-17 | 1 |
| 2 | 2022-12-25 | 2022-12-30 | 2022-12-24 | 1 |
| 3 | 2022-12-01 | 2023-01-30 | NULL | 1 |
+----------+-------------+-------------+----------------+-----------+
分析:可以看到hall_id = 1 的第4行数据,开始时间为2023-01-18 结束时间为2023-01-25的活动与第一行的活动存在交叉,所以应该被合并,但是由于中间,其前一行的活动截止日期为2023-01-17,早于该行活动的开始日期而被判断为不应该被合并,导致错判。
执行SQL
select hall_id,
start_date,
end_date,
max_end_date,
is_merge,
sum(is_merge) over (partition by hall_id order by start_date asc,end_date asc) as group_id
from (select hall_id,
start_date,
end_date,
max_end_date,
if(start_date <= max_end_date, 0, 1) as is_merge --0:合并,1:不合并
from (select hall_id,
start_date,
end_date,
max(end_date)
over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
from t_hall_event) t) tt
执行结果
+----------+-------------+-------------+---------------+-----------+-----------+
| hall_id | start_date | end_date | max_end_date | is_merge | group_id |
+----------+-------------+-------------+---------------+-----------+-----------+
| 1 | 2023-01-13 | 2023-01-20 | NULL | 1 | 1 |
| 1 | 2023-01-14 | 2023-01-16 | 2023-01-20 | 0 | 1 |
| 1 | 2023-01-14 | 2023-01-17 | 2023-01-20 | 0 | 1 |
| 1 | 2023-01-18 | 2023-01-25 | 2023-01-20 | 0 | 1 |
| 1 | 2023-01-20 | 2023-01-26 | 2023-01-25 | 0 | 1 |
| 2 | 2022-12-09 | 2022-12-23 | NULL | 1 | 1 |
| 2 | 2022-12-13 | 2022-12-17 | 2022-12-23 | 0 | 1 |
| 2 | 2022-12-20 | 2022-12-24 | 2022-12-23 | 0 | 1 |
| 2 | 2022-12-25 | 2022-12-30 | 2022-12-24 | 1 | 2 |
| 3 | 2022-12-01 | 2023-01-30 | NULL | 1 | 1 |
+----------+-------------+-------------+---------------+-----------+-----------+
取每个组内的start_day 的最小值作为活动开始日期,end_day的最大值作为活动结束日期,得到最终结果。注意分组条件为hall_id+group_id
执行SQL
select hall_id,
min(start_date) as start_date,
max(end_date) as end_date
from (select hall_id,
start_date,
end_date,
max_end_date,
is_merge,
sum(is_merge) over (partition by hall_id order by start_date asc,end_date asc) as group_id
from (select hall_id,
start_date,
end_date,
max_end_date,
if(start_date <= max_end_date, 0, 1) as is_merge --0:合并,1:不合并
from (select hall_id,
start_date,
end_date,
max(end_date)
over (partition by hall_id order by start_date asc,end_date asc rows between unbounded preceding and 1 preceding) as max_end_date
from t_hall_event) t) tt) ttt
group by hall_id, group_id --注意这里的分组,有group_id
执行结果
+----------+-------------+-------------+
| hall_id | start_date | end_date |
+----------+-------------+-------------+
| 1 | 2023-01-13 | 2023-01-26 |
| 2 | 2022-12-09 | 2022-12-24 |
| 2 | 2022-12-25 | 2022-12-30 |
| 3 | 2022-12-01 | 2023-01-30 |
+----------+-------------+-------------+
--建表语句
CREATE TABLE IF NOT EXISTS t_hall_event (
hall_id STRING, --大厅ID
start_date STRING, -- 营销活动开始日期
end_date STRING -- 营销活动结束日期
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
--数据插入
insert into t_hall_event(hall_id, start_date, end_date) values
('1','2023-01-13','2023-01-20'),
('1','2023-01-14','2023-01-17'),
('1','2023-01-14','2023-01-16'),
('1','2023-01-18','2023-01-25'),
('1','2023-01-20','2023-01-26'),
('2','2022-12-09','2022-12-23'),
('2','2022-12-13','2022-12-17'),
('2','2022-12-20','2022-12-24'),
('2','2022-12-25','2022-12-30'),
('3','2022-12-01','2023-01-30');