RegexSerDe 可以从 Hive 两个jar文件的类中获取,hive-serde-<version>.jar
中的 org.apache.hadoop.hive.contrib.serde2.RegexSerDe
以及 hive-contrib-<version>.jar
中的 org.apache.hadoop.hive.serde2.RegexSerDe
。
下面这种格式是 Apache 的打出的 Web 日志文件格式。包含我们想要获取的两个字段信息,一个是日志时间,一个是日志Json:
[2018-06-04 00:00:09 INFO price:335] {"os":"adr","phone":"187xxxx3617", "business":"train", "price":"198"}
我们使用 RegexSerDe 类作为 SERDE 在正则表达式的帮助下处理上面日志:
CREATE EXTERNAL TABLE IF NOT EXISTS adv_push_price (
time string,
line string
)
PARTITIONED BY(
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
'input.regex' = '\\[(\\d*-\\d*-\\d*\\s+\\d*:\\d*:\\d*)\\s+INFO.*\\]\\s+(.*)',
'input.regex.case.insensitive' = 'false',
'output.format.string' = '%1$s %2$s'
)
LOCATION '/user/xiaosi/log/price';
上面是一个外表,从 /user/xiaosi/log/price
路径下加载数据,并经正则表达式的处理,对应到 time
和 line
两个字段上,现在我们查看一下Hive表中的数据:
hive> select * from adv_push_price limit 10;
OK
2018-06-04 00:00:00 {"os":"adr","phone":"187xxxx3617", "business":"train", "price":"198"} 20180604
从上面输出中我们可以看到数据符合我们的预期。
我们也可以使用 org.apache.hadoop.hive.contrib.serde2.RegexSerDe
:
CREATE EXTERNAL TABLE IF NOT EXISTS adv_push_price (
time string,
line string
)
PARTITIONED BY(
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
'input.regex' = '\\[(\\d*-\\d*-\\d*\\s+\\d*:\\d*:\\d*)\\s+INFO.*\\]\\s+(.*)',
'input.regex.case.insensitive' = 'false',
'output.format.string' = '%1$s %2$s'
)
LOCATION '/user/xiaosi/log/price';
在运行过程中我们遇到如下错误:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:329)
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:364)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:106)
... 22 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:175)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:295)
... 24 more
上面的意思很明确,我们找不到 org.apache.hadoop.hive.contrib.serde2.RegexSerDe
类, hive-contrib-<version>.jar
可以在 $HIVE_HOME/lib
文件夹中找到,但是仍需要添加到环境变量中,将 hive-contrib-<version>.jar
配置到 Hive 会话中。如下所示将 hive-contrib-<version>.jar
添加到 HIVE_AUX_JARS_PATH
环境变量,将此 jar 永久添加到Hive会话中。在 conf/hive-site.xml
添加如下配置:
<property>
<name>hive.aux.jars.path</name>
<value>file:///home/q/hive/hive-2.1.0/lib/hive-contrib-2.1.0.jar</value>
</property>
org.apache.hadoop.hive.serde2.RegexSerDe
对应的 hive-serde-<version>.jar
默认包含在 hive 执行路径中,而 org.apache.hadoop.hive.contrib.serde2.RegexSerDe
对应的 hive-serde-<version>.jar
却不包含在 hive 执行路径中。
如果表中和数据中定义的列数不匹配,那么我们会遇到下面的错误消息: Diagnostic Messages for this Task: Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable 64.242.88.10 - - [07/Mar/2014:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable 64.242.88.10 - - [07/Mar/2014:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:501) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176) ... 8 more Caused by: org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns at org.apache.hadoop.hive.serde2.RegexSerDe.deserialize(RegexSerDe.java:180) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:136) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:100) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:492) ... 9 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
查看Hive表中声明的列数及其数据类型,以及正则表达式及其输出中的字段.format.string应包含相同数量的列。