需求
编写sql语句实现每班前三名,分数一样并列,
同时求出前三名按名次排序的分差
建表语句
create table student(
sid string,--学号
cid string,--班级号
score string -- 分数
)
row format delimited fields terminated by '\t'
;
数据
#说明:
数据1具有偶然性,适合不重复的情况,实现可以用扩展部分写法1实现
数据2具有通用性,适合重复数据的情况
#数据1
insert overwrite table student values
("1","1901","90"),
("2","1901","90"),
("3","1901","83"),
("4","1901","60"),
("5","1902","66"),
("6","1902","23"),
("7","1902","99"),
("8","1902","67"),
("9","1902","87");
#数据2
insert overwrite table student values
("1","1901","90"),
("2","1901","80"),
("3","1901","80"),
("4","1901","70"),
("5","1901","75"),
("6","1902","60"),
("7","1902","66"),
("8","1902","23"),
("9","1902","99"),
("10","1902","67");
实现
select
t1.cid,
t1.score,
t1.rank,
t2.diff_score
from
(select
cid,
score,
dense_rank() over(partition by cid order by score desc) rank
from
student
)t1
left join
(select
cid,
score,
nvl(score-lag(score) over(partition by cid order by score desc),0) diff_score
from
student
group by cid,score
)t2
on t1.score = t2.score
where t1.rank <= 3
;
结果
t1.cid t1.score t1.rank t2.diff_score
1901 90 1 0.0
1901 80 2 -10.0
1901 80 2 -10.0
1901 75 3 -5.0
1902 99 1 0.0
1902 67 2 -32.0
1902 66 3 -1.0
Time taken: 88.539 seconds, Fetched: 7 row(s)
分析
1、根据需求发现需要对班级开窗,使用over函数指定班级为分区键
2、根据分数排序,按常识倒序,且分数相同名次一样,所以使用dense_rank,
3、根据名次计算差值,使用lag函数,取出当前行的前第几个数,然后计算差值
4、lag函数,如果是第一个数,则默认取不到,返回null值,需要转化为0,所以使用nvl函数,可以减少使用if函数
#补充
lag函数根据排序规则默认取上一行,比如数据排序是[90,80,80..],那么第二个80取到的还是80,导致计算差值时结果变成了[0,-10,0],显然这不是我们想要的结果。
解决办法【序号和差值分开做,再关联】:先做排序,再补充分数差a表。分数差的做法,做一个去重排序差值,得到有序名词、分数、差值的b表,再用a表和b表通过排序序号关联补充分数差。
扩展
#写法1
select
t1.cid,
t1.score,
t1.rank,
t1.diff_score
from
(select
cid,
score,
dense_rank() over(partition by cid order by score desc) rank,
nvl(score-lag(score) over(partition by cid order by score desc),0) diff_score
from
student
)t1
where rank <= 3
;
OK
1901 90 1 0.0
1901 90 1 0.0
1901 83 2 -7.0
1901 60 3 -23.0
1902 99 1 0.0
1902 87 2 -12.0
1902 67 3 -20.0
Time taken: 29.836 seconds, Fetched: 7 row(s)
#写法2
select
t1.cid,
t1.score,
t1.rank,
nvl(t1.score-lag(t1.score) over(partition by t1.cid order by t1.score desc),0) diff_score
from
(select
cid,
score,
dense_rank() over(partition by cid order by score desc) rank
from
student
)t1
where t1.rank <= 3
;
Total MapReduce CPU Time Spent: 6 seconds 740 msec
OK
1901 90 1 0.0
1901 90 1 0.0
1901 83 2 -7.0
1901 60 3 -23.0
1902 99 1 0.0
1902 87 2 -12.0
1902 67 3 -20.0
Time taken: 49.058 seconds, Fetched: 7 row(s)
#写法2日志
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-04-15 15:05:31,734 Stage-1 map = 0%, reduce = 0%
2021-04-15 15:05:39,196 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.4 sec
2021-04-15 15:05:46,505 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.88 sec
MapReduce Total cumulative CPU time: 3 seconds 880 msec
Ended Job = job_1618468831901_0003
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1618468831901_0004, Tracking URL = http://my-hdp2:8088/proxy/application_1618468831901_0004/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1618468831901_0004
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2021-04-15 15:05:55,424 Stage-2 map = 0%, reduce = 0%
2021-04-15 15:06:02,852 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.97 sec
2021-04-15 15:06:11,485 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.86 sec
MapReduce Total cumulative CPU time: 2 seconds 860 msec
Ended Job = job_1618468831901_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.88 sec HDFS Read: 8747 HDFS Write: 278 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.86 sec HDFS Read: 7919 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 740 msec
#写法1日志
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1618468831901_0002, Tracking URL = http://my-hdp2:8088/proxy/application_1618468831901_0002/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1618468831901_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-04-15 14:59:54,428 Stage-1 map = 0%, reduce = 0%
2021-04-15 15:00:06,550 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.97 sec
2021-04-15 15:00:14,991 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1618468831901_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.91 sec HDFS Read: 10536 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 910 msec
对比实现部分sql,从控制台任务执行日志来看,写法1性能较高,时间较短
#lag默认用法
hive> select
> cid,
> score,
> dense_rank() over(partition by cid order by score desc) rank,
> lag(score) over(partition by cid order by score desc)
> from
> student
> ;
OK
1901 90 1 NULL
1901 90 1 90
1901 83 2 90
1901 60 3 83
1902 99 1 NULL
1902 87 2 99
1902 67 3 87
1902 66 4 67
1902 23 5 66
Time taken: 22.886 seconds, Fetched: 9 row(s)
知识点
1、nvl(value1,value2):如果value1是null,则返回value2,如果不是,则返回value1
2、dense_rank():基于over开窗函数的排序函数,如果值相同,则排序的序号相同,紧接的序号不跳过。举例123,1123,1223这样
3、lag(字段,n):基于over开窗函数,根据排序规则取当前行前第n个数,如果不指定n,则默认取前一个