一、什么是hive?
Hive是基于Hadoop的一个数据仓库工具(离线),可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。
特点:
Hive可以自由的扩展集群的规模,一般情况下不需要重启服务。
Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数。
良好的容错性,节点出现问题SQL仍可完成执行。
二、安装hive的安装 1:上传解压hive 2:安装mysql,添加开机自启 3:配置hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration>4:添加mysql*.jar 添加到$HIVE_HOME/lib 5:添加到path环境变量里面
三、hive概念与操作
1:使用服务的形式启动hive
nohup hiveserver2 1>/dev/null 2>&1 &2:连接方式
1:beeline
!connect jdbc:hive2://hadoop01:10000
root
2:beeline -u jdbc:hive2://hadoop01:10000 -n root3.基本操作语句 建表并指定分隔符
create table t_user(id string,name string) row format delimited fields terminated by ',';查看表结构 desc tablename 4.创建外部表
create external table t_access(ip String,url String,accesstime String )
row format delimited fields terminated by ','
location '/data/acc';外部表和内部表的区别? 内部表放在/usr/hive/warehouse下面,删除表的时候,会把数据删除掉。 外部表需要自己指定目录,删除表的时候,不会删除数据。 5.分区表
create table t_accc(id String,url String,access_time String)
partitioned by(dt String)
row format delimited
fields terminated by ',';向分区中导入数据 注意:分区字段不能是表中已定义的字段 不同的分区数据存放在不同的目录下面。
load data local inpath "/root/access.txt" into table t_access partition(dt="2018-8-28");多个分区字段
create table t_user(id String,name String,age String)
partitioned by(birthday String,sex String)
row format delimited
fields terminated by ',';
load data local inpath "/root/test/user.txt" into table t_user partition(birthday="1949-10-1",sex="man");6.load数据 1)手动用hdfs命令,将文件放入表目录 2)本地(linux)加载数据:
load data local inpath "/root/test" into table t_user;3)从HDFS上加载数据
load data inpath "/" into table t_user;7.导出数据 1)将hive表中的数据导入HDFS的文件
insert overwrite directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access;2)将hive表中的数据导入本地磁盘文件
insert overwrite local directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access limit 100000;8.CTAS建表语法 1)通过已有表来创建表
create table t_user2 like t_user;2)在建表的时候插入数据
create table t_access2
as
select ip,url from t_access;9.存储类型
stored as textfile; 默认为文本格式
stored as sequencefile;
stored as parquetfile
create table t_table(ip string,url string,sccess_time string)
row format delimited fields terminated by ','
stored as textfile;
insert into t_table select * from t_access;10.复合数据类型 1)array
create table t_movie(movie_name string,actors array<string>,first_data string)
row format delimited
fields terminated by ","
collection items terminated by ":";
load data local.............
select * from t_movie;
select moive_name,actors[0] from t_movie;
select moive_name,actors from t_movie where array_contains(actors,'吴刚');
select moive_name,size(actors) from t_movie;2)map
create table t_person(id int,name string,femily_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
select * from t_person;
## 取map字段的指定key的值
select id,name,family_members['father'] as father from t_person;
## 取map字段的所有key
select id,name,map_keys(family_members) as relation from t_person;
## 取map字段的所有value
select id,name,map_values(family_members) from t_person;
select id,name,map_values(family_members)[0] from t_person;
## 综合:查询有brother的用户信息
select id,name,father
from
(select id,name,family_members['brother'] as father from t_person) tmp
where father is not null;3)struct
create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';
select id,name,info.age from t_person_struct;11.设置本地MR查询
set hive.exec.mode.local.auto=true;12.join on... select * from t_a a join t_b b on a.name=b.name; join left join right join full join left semi join right semi join 13分组 group by having... 使用聚合函数的时候,条件只能在having里面使用
14.子查询 将查询出来的数据当成表再次查询
select id,name,brother from
(select id,name,family_members['brother'] as brother from t_person) tmp
where brother is not null;15.排序 order by desc 16.hive函数 1)类型转换函数
select cast("5" as int) ;
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);2)数学运算函数
select round(5.4); ## 5 四舍五入
select round(5.1345,3) ; ##5.135
select ceil(5.4) ; // select ceiling(5.4) from dual; ## 6 向上取整
select floor(5.4); ## 5 向下取整
select abs(-5.4) ; ## 5.4 绝对值
select greatest(3,5,6) ; ## 6
select least(3,5,6) from dual; ##求多个输入参数中的最小值3).字符串函数
substr(string str, int start) ## 截取子串
substring(string str, int start)
substr(string, int start, int len)
substring(string, int start, int len)
concat(string A, string B...) ## 拼接字符串
concat_ws(string SEP, string A, string B...)
length(string A)4).时间函数
select current_timestamp; ## 获取当前的时间戳(详细时间信息)
select current_date; ## 获取当前的日期
## 取当前时间的秒数时间戳--(距离格林威治时间1970-1-1 0:0:0秒的差距)
select unix_timestamp();
## unix时间戳转字符串
from_unixtime(bigint unixtime[, string format])
示例:select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");
## 字符串转unix时间戳
unix_timestamp(string date, string pattern)
示例: select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017-08-10 17:50:30","yyyy-MM-dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");5).条件控制函数 5.1 case when case when then when then else end from table_name 5.2 if
select id,name,age,if(age>40,'old','young') from t_user;6).集合函数 6.1 array_contains(Array<T>,value) 6.2 sort_array() 对数组排序 6.3 size(Array<T>) 返回一个集合的长度 6.4 map_keys(Map<k,v>)和map_values(Map<k,v>) 7).表生成函数 7.1 行转列函数 explode()
select distinct tmp.sub from
(select explode(subjects) as sub from t_student) as tmp;7.2表生成函数 lateral view 理解: lateral view 相当于两个表在join 左表:是原表 右表:是explode(某个集合字段)之后产生的表 而且:这个join只在同一行的数据间进行
select id,name,a.sub from
(select id,name,tem.sub as sub from t_student lateral view explode(subjects) tmp as sub)a
where sub = '生物';8).解析json格式
select json_tuple(json,'movie','rate','timeStamp','uid')as(movie,rate,ts,uid) from t_json;
解析时间戳获取分钟数
minute(from_unixtime(cast(ts as bigint))) as minute,9). 分析函数 row_number() over()----分组TOPN
select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber;17.自定义函数 1.在Java中编写函数 类需要实现UDF 重写evaluate()方法 2.将该方法打包成 .jar文件,传到linux下。 3.在hive命令行中输入下列命令 add jar /root/test/hql.jar 4.创建关联到java函数的hive函数 create temporary function get_max() as 'Test.getMaxIndex(全类名)'; 5.运行结果 select get_max(1,2,3);
18.sqoop 1)从MySQL上传到HDFS
sqoop import --connect jdbc:mysql://192.168.1.3:3306/default --username root --password root --table t_user --columns 'id, name, age'指定上传HDFS目录
sqoop import --connect mysql://192.168.1.3:3306/default --username root --password root --table T_user --target-dir '/sqoop/td' --fields-terminated-by '\t'指定map的数量: 在最后添加 -m 2 2)将HDFS上的数据导出到数据库中(不要忘记指定分隔符)
sqoop export --connect jdbc:mysql://192.168.1.3:3306/default --username root --password root --export-dir '/td3' --table td_bak -m 1 --fields-terminated-by ','19. java 操作hive
public class HiveJdbcTest {
public static void main(String[] args) throws Exception {
Class.forName("org.apache.hive.jdbc.HiveDriver"); //加载驱动
Connection conn = DriverManager.getConnection( "jdbc:hive2://hadoop01:10000","root",""); //建立连接
Statement statement = conn.createStatement(); //创建执行SQL语句的对象
ResultSet set = statement.executeQuery("select * from t_user");
while(set.next()){
String str = set.getString(2);
System.out.println(str);
}
set.close();
statement.close();
conn.close();
}
}