最近对hive的join用的比较多,特地归纳下常用的各种连接,看看hive的连接和我们普通的是否有不同。创建ta.txt和tb.txt两个文件,加载数据:
hive (cfpd_ods_safe)> load data local inpath
'/data/bdp/bdp_etl_deploy/hduser06/jaysonding/ta.txt' into table ta;
hive (cfpd_ods_safe)> load data local inpath
'/data/bdp/bdp_etl_deploy/hduser06/jaysonding/tb.txt' into table tb;
查询数据:
hive (cfpd_ods_safe)> select * from ta;
OK
ta.uid
1111
2222
3333
4444
Time taken: 0.087 seconds, Fetched: 4 row(s)
hive (cfpd_ods_safe)> select * from tb;
OK
tb.uid
1111
2222
5555
Time taken: 0.183 seconds, Fetched: 3 row(s)
现在尝试来连接了。
(1)普通的,连接:
ta.uid tb.uid
1111 1111
1111 2222
1111 5555
2222 1111
2222 2222
2222 5555
3333 1111
3333 2222
3333 5555
4444 1111
4444 2222
4444 5555
Time taken: 21.328 seconds, Fetched: 12 row(s)
可见普通逗号,不带条件结果就是一个笛卡尔积。再看带条件的:
hive (cfpd_ods_safe)> select * from ta,tb where ta.uid=tb.uid;
ta.uid tb.uid
1111 1111
2222 2222
Time taken: 23.147 seconds, Fetched: 2 row(s)
(2)内连接 inner join:
hive (cfpd_ods_safe)> select * from ta inner join tb on ta.uid=tb.uid;
ta.uid tb.uid
1111 1111
2222 2222
Time taken: 21.597 seconds, Fetched: 2 row(s)
可见inner join和直接逗号连接效果是一样的。
(3)左连接left join:
hive (cfpd_ods_safe)> select * from ta left join tb on ta.uid=tb.uid;
ta.uid tb.uid
1111 1111
2222 2222
3333 NULL
4444 NULL
Time taken: 22.921 seconds, Fetched: 4 row(s)
(5)左外连接 left outer join:
hive (cfpd_ods_safe)> select * from ta left outer join tb on ta.uid=tb.uid;
ta.uid tb.uid
1111 1111
2222 2222
3333 NULL
4444 NULL
Time taken: 22.637 seconds, Fetched: 4 row(s)
(6)全连接 full join:
hive (cfpd_ods_safe)> select * from ta full join tb on ta.uid=tb.uid;
ta.uid tb.uid
1111 1111
2222 2222
3333 NULL
4444 NULL
NULL 5555
Time taken: 19.39 seconds, Fetched: 5 row(s)
(7)全外连接 full outer join:
hive (cfpd_ods_safe)> select * from ta full outer join tb on ta.uid=tb.uid;
ta.uid tb.uid
1111 1111
2222 2222
3333 NULL
4444 NULL
NULL 5555
Time taken: 20.414 seconds, Fetched: 5 row(s)
结论:
(1)inner join效果和逗号连接一样,逗号其实是inner join的简写。
(2)不带条件的所有连接都是笛卡尔积
(3)left join和left outer join是一样的,full join和full outer join是一样的。right一样。