SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list]
[SORT BY| ORDER BY col_list] ]
[LIMIT number]
order by
会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。sort by
不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置 mapred.reduce.tasks
>1,则sort by只保证每个reducer的输出有序,不保证全局有序。distribute by
(字段)根据指定的字段将数据分到不同的reducer,且分发算法是hash散列。Cluster by
(字段) 除了具有Distribute by
的功能外,还会对该字段进行排序。 —> distribute by + sort by
因此,如果分桶和sort字段是同一个时,此时, cluster by = distribute by + sort by
分桶表的作用:最大的作用是用来提高join操作的效率;
思考这个问题: select a.id,a.name,b.addr from a join b on a.id = b.id;
如果a表和b表已经是分桶表,而且分桶的字段是id字段 做这个join操作时,还需要全表做笛卡尔积吗?
全表查询
select * from score;
选择特定列
select s_id ,c_id from score;
列别名 1)重命名一个列。 2)便于计算。 3)紧跟列名,也可以在列名和别名之间加入关键字‘AS’
select s_id as myid ,c_id from score;
(count) select count(1) from score;
(max) select max(s_score) from score;
(min) select min(s_score) from score;
(sum) select sum(s_score) from score;
(avg) select avg(s_score) from score;
典型的查询会返回多行数据。LIMIT子句用于限制返回的行数。
select * from score limit 3;
select * from score where s_score > 60;
select * from score where s_score = 80;
select * from score where s_score between 80 and 100;
select * from score where s_score is null;
select * from score where s_score in(80,90);
% 代表零个或多个字符(任意个字符)。
_ 代表一个字符。
select * from score where s_score like '8%';
select * from score where s_score like '_9%';
select * from score where s_score rlike '[9]'; # like '%9%'
select * from score where s_score >80 and s_id = '01';
select * from score where s_score > 80 or s_id = '01';
select * from score where s_id not in ('01','02');
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。
select s_id ,avg(s_score) from score group by s_id;
select s_id ,max(s_score) from score group by s_id;
select s_id ,avg(s_score) from score group by s_id;
select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85;