Hive一次更新多个分区数据方案

awwewwbbb

发布于 2022-09-16 12:29:34

1.5K0

文章被收录于专栏：chaplinthink的专栏chaplinthink的专栏

场景

订单数据之类的业务表，因为有状态要更新，比如订单状态，物流状态之类的，需要同步很久之前的数据到Hive. 如何同步时在Hive中进行操作一次更新多个分区内的数据?

Hive 操作

设置Hive动态分区

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

创建分区表:

源表:

CREATE TABLE `ods_binlog_person`(
  `binlog_id` bigint,
  `binglog_es` bigint,
  `binlog_ts` bigint,
  `binlog_type` string,
  `id` bigint,
  `name` string,
  `score` int,
  `created_at` string,
  `updated` string)
PARTITIONED BY (`dt` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

加载数据:

load data inpath '/camus/exec/binlog/person/pt_hour=2022072400' into table ods_binlog_person  partition (dt='2022072400')

目标表

CREATE TABLE  IF NOT EXISTS  temp_partition_table(
   id string comment "字段id",
   name string comment "字段注释",
  `score` int,
  `created_at` string,
  `updated` string
)  COMMENT "分区表"
PARTITIONED BY(`dt` string)
STORED AS ORC
TBLPROPERTIES("orc.compress"="SNAPPY");

插入分区数据

insert overwrite table temp_partition_table partition(dt) 
select 
  id, 
  name,
  score,
  created_at,
  updated,
  from_unixtime(unix_timestamp(created_at,'yyyy-MM-dd HH:mm:ss'), 'yyyyMMdd')

from ods_binlog_person where dt = 2022072400;

查看分区

show partitions temp_partition_table;


OK
dt=20220717
dt=20220720
Time taken: 0.175 seconds, Fetched: 2 row(s)

查看表信息

show create table temp_partition_table;

或者

desc temp_partition_table

加载到目标表后, 可以删除源表中的分区数据,避免数据冗余

alter table ods_binlog_person  drop partition(dt=2022072400)

结论

通过Hive动态分区, 我们就实现基于源表的业务时间生成目标表的分区, 并且将数据加载到对应分区中. 然后删除源表对应分区的数据,避免数据冗余节省空间.

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2022-07-24，如有侵权请联系 cloudcommunity@tencent.com 删除

hive

Hive一次更新多个分区数据方案

Hive一次更新多个分区数据方案

场景

Hive 操作

结论

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐