首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何仅在单元格中不存在分区时插入覆盖分区?

如何仅在单元格中不存在分区时插入覆盖分区?
EN

Stack Overflow用户
提问于 2019-09-06 03:33:35
回答 2查看 3K关注 0票数 3

如何仅在单元格中不存在分区时插入覆盖分区?

就像头衔。我一直在做一些需要重写蜂巢表的事情。我有具有多个分区的表,我只希望在更改后重新运行代码时插入新分区而不更改现有分区。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-09-06 06:07:42

您可以加入现有的分区列表,并将其添加到空条件的位置(而不是仅加入)。此外,您还可以使用NOT存在(它将生成与Hive中的左联接相同的计划),如下所示:

代码语言:javascript
运行
复制
   insert overwrite table target_table partition (partition_key)
    select col1, ... coln, s.partition_key
      from source s 
           left join (select distinct partition_key --existing partitions
                       from target_table
                     ) t on s.partition_key=t.partition_key
     where t.partition_key is NULL; --no partitions exists in the target
票数 2
EN

Stack Overflow用户

发布于 2019-09-06 07:54:25

一种选择是连接(分区列上的左联接作为键)、从目标表中带有不同分区列的源数据集,并筛选出共同的分区。您知道我的意思;您的蜂巢查询应该如下所示:

代码语言:javascript
运行
复制
insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
   src.column1,
   src.column2,
   ....,
   src.columnN,
   src.partition_column1,
   src.partition_column2,
   ....,
   src.partition_columnN
from
   source src 
   left join
      (
         select distinct
            partition_column1,
            partition_column2,
            ....,
            partition_columnN
         from
            target
      )
      tgt 
      on src.partition_column1 = tgt.partition_column1 
      and src.partition_column1 = tgt.partition_column1
      ...
      src.partition_columnN = tgt.partition_columnN 
where
   tgt.partition_column1 is null 
   or tgt.partition_column2 is null
   ...
   tgt.partition_columnN is null;

下面给出了这个逻辑的简单演示:

让我们创建两个名为orders的表和orders_source。order表将是目标表,而orders_source是源表。为了简单起见,我对这两个表都使用了类似的模式。

代码语言:javascript
运行
复制
CREATE TABLE `orders`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');

CREATE TABLE `orders_source`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');

接下来,插入一些用于测试逻辑的示例记录:

代码语言:javascript
运行
复制
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

insert overwrite table orders partition (state, order_date) 
select
   orde.id,
   orde.customer_id,
   orde.shipper_id,
   orde.state,
   orde.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10241 as id,
         20481 as customer_id,
         30721 as shipper_id,
         'GA' as state,
         '2019-09-01' as order_date
   )
   orde;

insert overwrite table orders_source partition (state, order_date) 
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10242 as id,
         20482 as customer_id,
         30722 as shipper_id,
         'CA' as state,
         '2019-09-02' as order_date 
      union all
      select
         10243 as id,
         20483 as customer_id,
         30723 as shipper_id,
         'FL' as state,
         '2019-09-02' as order_date 
      union all
      select
         10244 as id,
         20484 as customer_id,
         30724 as shipper_id,
         'TX' as state,
         '2019-09-02' as order_date
   )
   orso;

现在,让我们在运行实际的业务逻辑之前检查我们在两个表中插入的数据:

代码语言:javascript
运行
复制
hive (default)> select * from orders_source;
OK
orders_source.id    orders_source.customer_id   orders_source.shipper_id    orders_source.state orders_source.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)

hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10241   20481   30721   GA  2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)

接下来,执行我们的逻辑,从源表中选择记录并插入到目标表中。您可以运行以下查询:

代码语言:javascript
运行
复制
hive (default)> select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;
OK
orso.id orso.customer_id    orso.shipper_id orso.state  orso.order_date
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
10242   20482   30722   CA  2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)

您可以看到上面的结果。

最后,通过发出以下查询将记录插入目标表:

代码语言:javascript
运行
复制
insert overwrite table orders partition (state, order_date)
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;

现在,让我们在insert操作之后验证目标表中的数据。

代码语言:javascript
运行
复制
hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10241   20481   30721   GA  2019-09-01
10244   20484   30724   TX  2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)

就这样。你已经准备好了!

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57815367

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档