

在大数据分析中,去重计算(如 Count Distinct)是一个常见但计算开销极高的操作,尤其在高基数和高并发场景下,常常成为查询性能的瓶颈。以用户访问行为为例,同一用户一天内多次访问页面时,PV 会累加,而 UV 应仅记一次,这种典型的去重统计在实际业务中非常常见。然而,直接执行去重操作在数据量大、查询频繁的条件下,往往效率低下。
针对这一挑战,StarRocks 提供了多种优化策略,常见做法包括:以精度换取性能、将高成本的数据类型(如 String)转为低成本类型(如 Int/BigInt)、利用 Bitmap 或 HLL 等高效的数据结构,以及通过物化视图实现预计算。接下来,本文将围绕这些方案展开详细分析,并结合实际应用场景,评估它们在性能、精度与易用性之间的权衡。
比如基于 SSB 中的lineorder表作为示例,如何加速计算基于lineorder的一些去重计算呢?
CREATE TABLE IF NOT EXISTS `lineorder` (
`lo_orderkey` int(11) NOT NULL COMMENT "",
`lo_linenumber` int(11) NOT NULL COMMENT "",
`lo_custkey` int(11) NOT NULL COMMENT "",
`lo_partkey` int(11) NOT NULL COMMENT "",
`lo_suppkey` int(11) NOT NULL COMMENT "",
`lo_orderdate` datetime NOT NULL COMMENT "",
`lo_orderpriority` varchar(16) NOT NULL COMMENT "",
`lo_shippriority` int(11) NOT NULL COMMENT "",
`lo_quantity` int(11) NOT NULL COMMENT "",
`lo_extendedprice` int(11) NOT NULL COMMENT "",
`lo_ordtotalprice` int(11) NOT NULL COMMENT "",
`lo_discount` int(11) NOT NULL COMMENT "",
`lo_revenue` int(11) NOT NULL COMMENT "",
`lo_supplycost` int(11) NOT NULL COMMENT "",
`lo_tax` int(11) NOT NULL COMMENT "",
`lo_commitdate` int(11) NOT NULL COMMENT "",
`lo_shipmode` varchar(11) NOT NULL COMMENT ""
) DUPLICATE KEY(`lo_orderkey`)
PARTITION BY date_trunc('day', `lo_orderdate`);
-- Q1: How to speed up this query?
SELECT
lo_orderdate,
lo_custkey,
lo_orderkey,
count (distinct lo_orderkey) as ndv10,
count(distinct case when lo_revenue > 1 then lo_orderkey else 0 end) as ndv12
FROM
lineorder
GROUP BY 1, 2, 3 ORDER BY 1, 2, 3 LIMIT 10;
-- Q2: How to speed up this query?
SELECT
lo_orderdate,
count(distinct lo_shipmode) as ndv4
FROM
lineorder
GROUP BY 1 ORDER BY 1 LIMIT 10;;
-- Q3: How to speed up this query?
SELECT
count(distinct case when lo_discount > 1 then lo_orderkey else 0 end) as ndv11,
count(distinct case when lo_revenue > 1 then lo_orderkey else 0 end) as ndv12
FROM
lineorder;
-- Q4: How to speed up this query?
SELECT
count(distinct case when lo_discount > 1 then lo_orderkey else 0 end) as ndv11,
count(distinct case when lo_revenue > 1 then lo_orderkey else 0 end) as ndv12
FROM
lineorder
WHERE lo_orderdate >= 19930101 and lo_orderdate <= 19950101;优化 Count Distinct 操作有多种方法可选,但如何选择合适的方案呢?

精度越高,性能越差。精度排序(从高到低)如下:

优点:
缺点:
https://docs.starrocks.io/docs/using_starrocks/distinct_values/Using_HLL/
优点:
缺点:
https://docs.starrocks.io/zh/docs/using_starrocks/distinct_values/Using_bitmap/

根据预估的基数选择合适的类型,以避免哈希冲突。这可以与生成列结合使用。举例来说:
CREATE TABLE dest_hash_ge (
imsi STRING,
imsi_hash BIGINT AS xx_hash3_64(imsi)
);
select count(distinct imsi_hash) from dest_hash_ge;优点:
直观且易于使用,无需外部依赖。
缺点:
精度可能会受到基数的影响而降低。
https://docs.starrocks.io/docs/category/hash/
整体方法:使用字典表将字符串转换为整数,然后将整数插入目标表。这样,统计不同整数的速度比统计不同字符串快得多。
优点:
缺点:
https://docs.starrocks.io/docs/using_starrocks/query_acceleration_with_auto_increment/
在物化视图构建的过程中,我们需要考虑以下几个关键点:
4.1.1 数据分布
物化视图的数据分布是构建时首要考虑的因素,对查询性能和刷新稳定性影响显著。
4.1.2 索引
可在物化视图之上添加必要的索引,以提升带谓词的查询性能。类似于 StarRocks 内表,可以针对过滤列添加适当的 Bitmap 或 BloomFilter 索引,以减少磁盘 I/O,从而优化查询效率。(https://docs.starrocks.io/docs/table_design/indexes/#bitmap-indexes)。
4.1.3 通用性
用户通常期望构建的物化视图具备一定的通用性,以便在性能与维护成本之间取得平衡,减少物化视图数量并降低维护代价。为了实现这一目标,可采用以下策略:
4.1.4 数据刷新
当前所有物化视图均采用分区设计,并统一设置为 Manual(手动)刷新方式,用于同步物化视图与基表数据。之所以选择手动刷新,是因为基表经常发生删表重建操作,如果采用自动刷新,将可能导致物化视图数据过期并触发全量重刷,造成大量资源与时间浪费。
考虑到基表数据量较大,且查询通常只涉及最近 6 个月的数据,在刷新过程中会通过手动指定刷新范围,仅同步最近分区。例如:
REFRESH MATERIALIZED VIEW mv1 PARTITION START ("2023-04-01") END ("2023-09-30");后续可通过设置 partition_ttl 或 partition_ttl_number 参数,自动控制物化视图仅保留最近时间范围或指定数量的分区。
4.1.5 时效性问题
为保障Query在物化视图改写的过程中的严格一致,目前物化视图改写默认只对已经刷新的数据改写。但如果基表有更新,目前可以通过如下方式:
为保障在物化视图改写过程中的查询严格一致性,StarRocks 默认仅对已刷新完成的物化视图进行改写。但在基表发生更新时,可以通过以下方式调整:
query_rewrite_consistency : 调整为loose ,不校验物化视图的时效性,计算结果以物化视图为准,而不再以查询本身为准;mv_rewrite_staleness_second : 调整物化视图改写可以接受的最大延迟时间(单位:秒);Union Rewrite :对于必须保证严格一致性的查询,可在改写过程中将基表数据与物化视图结果通过 Union 合并的方式进行改写,提升查询性能。针对前文中的几个 Query,可根据上述物化视图构建策略,在不同场景下(精确去重或非精确去重)构建合适的物化视图,以加速计算。
4.2.1 精确去重
精确去重旨在确保基于物化视图计算的结果与直接执行 COUNT(DISTINCT) 查询的结果完全一致。
基于 Bitmap+MV 加速精确去重
若无法转换,则这些值会被视为 NULL。
CREATE MATERIALIZED VIEW `test_mv1`
PARTITION BY (`lo_orderdate`)
DISTRIBUTED BY HASH(lo_orderdate, lo_custkey, lo_orderkey)
REFRESH DEFERRED MANUAL
PROPERTIES (
"replication_num" = "1"
)
AS
SELECT
lo_orderdate,
lo_custkey,
lo_orderkey,
-- ndv: multi count-distinct metrics/columns
bitmap_union(to_bitmap(lo_orderkey)) as ndv10,
bitmap_union(to_bitmap(lo_linenumber)) as ndv2,
bitmap_union(to_bitmap(lo_orderpriority)) as ndv3,
-- DANGER: lo_shipmode is string type only works if lo_shipmode can cast to bigint implicitly,
-- otherwise null for non-casted rows.
bitmap_union(to_bitmap(lo_shipmode)) as ndv4,
-- ndv: the same count distinct column with different conditions
bitmap_union(to_bitmap(case when lo_discount > 1 then lo_orderkey else 0 end)) as ndv11,
bitmap_union(to_bitmap(case when lo_revenue > 1 then lo_orderkey else 0 end)) as ndv12
FROM
lineorder
GROUP BY 1, 2, 3;基于 ArrayAgg + MV 加速精确去重
CREATE MATERIALIZED VIEW `test_mv2`
PARTITION BY (`lo_orderdate`)
DISTRIBUTED BY HASH(lo_orderdate, lo_custkey, lo_orderkey)
REFRESH DEFERRED MANUAL
PROPERTIES (
"replication_num" = "1"
)
AS
SELECT
lo_orderdate,
lo_custkey,
lo_orderkey,
-- ndv: multi count-distinct metrics/columns with numeric types
array_distinct(array_agg(lo_orderkey)) as ndv10,
array_distinct(array_agg(lo_linenumber)) as ndv2,
array_distinct(array_agg(lo_orderpriority)) as ndv3,
-- ndv: input is string type
array_distinct(array_agg(lo_shipmode)) as ndv4,
-- ndv: the same count distinct column with different conditions
array_distinct(array_agg(case when lo_discount > 1 then lo_orderkey else 0 end)) as ndv11,
array_distinct(array_agg(case when lo_revenue > 1 then lo_orderkey else 0 end)) as ndv12
FROM
lineorder
GROUP BY 1, 2, 3;4.2.2 非精确去重
精确去重无法保证基于物化视图计算的结果依然是精确的,与直接使用 count distinct 相比,结果可能存在误差。因此,在实际使用中需要根据具体业务场景权衡选择。
基于 Bitmap + MV 加速近似去重
bitmap_hash 代替to_bitmap, 但由于其实现原理不同,存在因 hash 冲突导致去重结果不精确问题;hash 值(取值范围为 0 ~ 2^32 区间的 unsigned int),再将该 hash 值记录到Bitmap 中;CREATE MATERIALIZED VIEW `test_mv3`
PARTITION BY (`lo_orderdate`)
DISTRIBUTED BY HASH(lo_orderdate, lo_custkey, lo_orderkey)
REFRESH DEFERRED MANUAL
PROPERTIES (
"replication_num" = "1"
)
AS
SELECT
lo_orderdate,
lo_custkey,
lo_orderkey,
-- ndv: multi count-distinct metrics/columns
bitmap_union(bitmap_hash(lo_orderkey)) as ndv10,
bitmap_union(bitmap_hash(lo_linenumber)) as ndv2,
bitmap_union(bitmap_hash(lo_orderpriority)) as ndv3,
-- For string type input, use bitmap_hash to compute its hash(0~max(uint64) which
-- may be hash-conflict and store the hash value into bitmap
bitmap_union(bitmap_hash(lo_shipmode)) as ndv3,
-- ndv: the same count distinct column with different conditions
bitmap_union(bitmap_hash(case when lo_discount > 1 then lo_orderkey else 0 end)) as ndv11,
bitmap_union(bitmap_hash(case when lo_revenue > 1 then lo_orderkey else 0 end)) as ndv12
FROM
lineorder
GROUP BY 1, 2, 3;基于 HLL+MV 加速精确去重
hyperloglog 格式进行存储;CREATE MATERIALIZED VIEW `test_mv4`
PARTITION BY (`lo_orderdate`)
DISTRIBUTED BY HASH(lo_orderdate, lo_custkey, lo_orderkey)
REFRESH DEFERRED MANUAL
PROPERTIES (
"replication_num" = "1"
)
AS
SELECT
lo_orderdate,
lo_custkey,
lo_orderkey,
-- ndv: multi count-distinct metrics/columns
hll_union(hll_hash(lo_orderkey)) as ndv10,
hll_union(hll_hash(lo_linenumber)) as ndv2,
hll_union(hll_hash(lo_orderpriority)) as ndv3,
-- ndv: column with string type
hll_union(hll_hash(lo_shipmode)) as ndv4,
-- ndv: the same count distinct column with different conditions
hll_union(hll_hash(case when lo_discount > 1 then lo_orderkey else 0 end)) as ndv11,
hll_union(hll_hash(case when lo_revenue > 1 then lo_orderkey else 0 end)) as ndv12
FROM
lineorder
GROUP BY 1, 2, 3;在创建完上述任一物化视图后,即可对相关查询实现透明改写与加速。但在实际使用中,需关注以下要点:
通过合理选型与参数配置,既可以大幅提升查询性能,又能在精度与资源之间找到最佳平衡点。希望本文的内容能为你在复杂去重计算场景中提供清晰的技术参考与实操思路。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。