1.增量表和MERGE
以正确的方式有效更新表很重要。理想的情况是当您的事务是主键、唯一整数和自动增量时。这种情况下的表更新很简单:
insert target_table (transaction_id)
select transaction_id from source_table
where
transaction_id > (select max(transaction_id) from target_table)
;
在数据仓库中使用非规范化星型模式数据集时,情况并非总是如此。可能需要使用 SQL 创建会话和/或仅使用部分数据增量更新数据集。transaction_id
可能不存在,但您将不得不处理数据模型,其中唯一键取决于transaction_id
已知的最新(或时间戳)。例如,数据user_id
集中last_online
取决于最新的已知连接时间戳。在这种情况下,您需要update
现有用户和insert
新用户。
您可以使用MERGE,也可以将操作拆分为两个操作。一种是用新记录更新现有记录,另一种是插入不存在的全新记录(LEFT JOIN 情况)。
MERGE是关系数据库中常用的语句。Google BigQuery MERGE 命令是数据操作语言 (DML) 语句之一。它通常用于在一条语句中自动执行三个主要功能。这些函数是 UPDATE、INSERT 和 DELETE。
这意味着 Google BigQuery MERGE 命令可让您通过更新、插入和删除 Google BigQuery 表中的数据来合并 Google BigQuery 数据。
考虑这个 SQL:
create temp table last_online as ( -- 最近在线表
select 1 as user_id
, timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data ( --分区
user_id int64
,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
select 2 as user_id
, timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
select 1 as user_id
, timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
select 1 as user_id
, timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
select 1 as user_id
, timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;
merge last_online t
using (
select
user_id
, last_online
from
(
select
user_id
, max(timestamp) as last_online
from
connection_data
where
date(_partitiontime) >= date_sub(current_date(), interval 1 day)
group by
user_id
) y
) s
on t.user_id = s.user_id
when matched then
update set last_online = s.last_online, user_id = s.user_id
when not matched then
insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;
在这里,我们仅在一次操作中完成了 UPDATE 和 INSERT。
执行 UNNEST() 并检查您需要的单词是否在您需要的列表中可能在许多情况下很有用,即情感分析:
with titles as (
select 'Title with word foo' as title union all
select 'Title with word bar'
)
, data as (
select
title,
split(title, ' ') as words
from
titles
)
select * from data, unnest(words) words
where
words in ('bar')
;
这使我们有机会节省一些代码行并在代码方面更加雄辩。通常您希望将其放入子查询中,并在where子句中添加过滤器,但您可以这样做:
with daily_revenue as (
select
current_date() as dt
, 100 as revenue
union all
select
date_sub(current_date(), interval 1 day) as dt
, 100 as revenue
)
select
*
from daily_revenue
where
if(revenue >101,1,0) = 1
;
另一个例子是如何不将它与分区表一起使用。不要这样做。这是一个不好的例子,因为由于匹配的表后缀可能是动态确定的(基于表中的某些内容),因此您将需要为全表扫描付费。
SELECT *
FROM `firebase.events`
WHERE IF(condition,
_TABLE_SUFFIX BETWEEN '20170101' AND '20170117',
_TABLE_SUFFIX BETWEEN '20160101' AND '20160117')
;
ROLLUP函数用于执行多个级别的聚合。当您必须使用维度图时,这非常有用。
以下查询返回在where子句中指定的交易类型 (is_gift) 每天的总信用支出,并且还显示每天的总支出以及所有可用日期的总支出。
ith data as (
select
current_timestamp() as ts
,'stage' as context_type
,1 as user_id
,100 as credit_value
, true as is_gift
union all
select
timestamp_sub(current_timestamp(), interval 24 hour) as ts
,'user' as context_type
,1 as user_id
,200 as credit_value
,false as is_gift
union all
select
timestamp_sub(current_timestamp(), interval 24*2 hour) as ts
,'user' as context_type
,3 as user_id
,300 as credit_value
,true as is_gift
)
, results as (
select
date(ts) as date
,context_type
,sum(credit_value)/100 as daily_credits_spend
from data
group by rollup(1, context_type)
order by 1
)
select
date
,if(context_type is null, 'total', context_type) as context_type
,daily_credits_spend
from results
order by date
;
想象一下,您需要将表转换为 JSON 对象,其中每个记录都是嵌套数组的元素。这就是to_json_string()
函数发挥作用的地方:
with mytable as (
select 1 as x, 'foo' as y, true as z union all
select 2, 'bar', false
)
select
concat("{", "\"MyTable\":", "[", string_agg(to_json_string(t), ","), "]", "}")
from mytable as t
;
然后您可以在任何地方使用它:日期、营销漏斗、指数、直方图等。
给定user_id
、date
和total_cost
列。对于每个日期,如何在保留所有行的同时显示每个客户的总收入值?你可以这样实现:
select
date
,user_id
,total_cost
,sum(total_cost) over (partition by date,user_id) as revenue_per_day
from production.payment_transaction
;
通常,BI 开发人员的任务是向报告及其出色的仪表板添加移动平均值。这可能是 7、14、30 天/月甚至一年的 MA 线图。那么我们该怎么做呢?
with dates as (
select
dt
from
unnest(generate_date_array(date_sub(current_date(), interval 90 day), current_date(), interval 1 day)) as dt
)
, data as (
select dt
, CEIL(RAND()*1000) as revenue -- just some random data.
from
dates
)
select
dt
, revenue
, AVG(revenue) OVER(ORDER BY unix_date(dt) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) as seven_day_moving_average
from data
;
当您处理用户保留或想要检查某些数据集是否缺少值(即日期)时,它变得非常方便。BigQuery 有一个名为的函数GENERATE_DATE_ARRAY
:
select
dt
from
unnest(generate_date_array('2019–12–04', '2020–09–17', interval 1 day)) as dt
;
这对于从数据中获取最新信息(即最新更新的记录等)甚至删除重复项很有用:
SELECT *
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
WHERE b.date < a.date
QUALIFY ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.date desc) = 1
它们也称为排序函数。倾向于使用DENSE_RANK
默认排名功能,因为它不会跳过下一个可用排名,而RANK
会。它返回连续的排名值。您可以将其与分区一起使用,将结果划分为不同的存储桶。如果每个分区中的行具有相同的值,则它们将获得相同的排名。例子:
with top_spenders as (
select 1 as user_id, 100 as total_spend, 11 as reputation_level union all
select 2 as user_id, 250 as total_spend, 11 as reputation_level union all
select 3 as user_id, 250 as total_spend, 11 as reputation_level union all
select 4 as user_id, 300 as total_spend, 11 as reputation_level union all
select 11 as user_id, 1000 as total_spend, 22 as reputation_level union all
select 22 as user_id, 1500 as total_spend, 22 as reputation_level union all
select 33 as user_id, 1500 as total_spend, 22 as reputation_level union all
select 44 as user_id, 2500 as total_spend, 22 as reputation_level
)
select
user_id
, rank() over(partition by reputation_level order by total_spend desc) as rank
, dense_rank() over(partition by reputation_level order by total_spend desc) as dense_rank
from
top_spenders
;
另一个产品价格的例子:
with products as (
select
2 as product_id
, 'premium_account' as product_type
, 100 as total_cost
union all
select
1 as product_id
, 'premium_group' as product_type
, 200 as total_cost
union all
select
111 as product_id
, 'bots' as product_type
, 300 as total_cost
union all
select
112 as product_id
, 'bots' as product_type
, 400 as total_cost
union all
select
113 as product_id
, 'bots' as product_type
, 500 as total_cost
union all
select
213 as product_id
, 'bots' as product_type
, 300 as total_cost
)
select * from (
select
product_id
, product_type
, total_cost as product_price
, dense_rank () over (
partition by product_type
order by total_cost desc
) price_rank
from
products
) t
where price_rank < 3
;
透视将行更改为列。这就是它的全部作用。Unpivot 则相反。最后它可以在 BigQuery 中使用:
elect * from
(
-- #1 from_item
select
extract(month from dt) as mo
,product_type
,revenue
from (
select
date(current_date()) as dt
, 'premium_account' as product_type
, 100 as revenue
union all
select
date_sub(current_date(), interval 1 month) as dt
, 'premium_group' as product_type
, 200 as revenue
union all
select
date_sub(current_date(), interval 2 month) as dt
, 'bots' as product_type
, 300 as revenue
)
)
pivot
(
-- #2 aggregate
avg(revenue) as avg_revenue_
-- #3 pivot_column
for product_type in ('premium_account', 'premium_group')
)
;
13.First_value / last_value
这是另一个有用的函数,它有助于获取每行相对于该特定分区中的第一个/最后一个值的增量。
with top_spenders as (
select 1 as user_id, 100 as total_spend, 11 as reputation_level union all
select 2 as user_id, 150 as total_spend, 11 as reputation_level union all
select 3 as user_id, 250 as total_spend, 11 as reputation_level union all
select 11 as user_id, 1000 as total_spend, 22 as reputation_level union all
select 22 as user_id, 1500 as total_spend, 22 as reputation_level union all
select 33 as user_id, 2500 as total_spend, 22 as reputation_level
)
, data as (
select
user_id
,total_spend
,reputation_level
,first_value(total_spend)
over (partition by reputation_level order by total_spend desc
rows between unbounded preceding and unbounded following) as top_spend
from top_spenders
)
select
user_id
,reputation_level
,total_spend
,top_spend as top_spend_by_rep_level
,total_spend - top_spend as delta_in_usd
from data
;
当您需要将具有一些复杂逻辑的用户定义函数 (UDF) 应用于每行或表时,这非常有用。您始终可以将表视为 TYPE STRUCT 对象的数组,然后将其中每个对象传递给 UDF。这取决于你的逻辑。例如,我用它来计算购买过期时间:
select
target_id
,product_id
,product_type_id
,production.purchase_summary_udf()(
ARRAY_AGG(
STRUCT(
target_id
, user_id
, product_type_id
, product_id
, item_count
, days
, expire_time_after_purchase
, transaction_id
, purchase_created_at
, updated_at
)
order by purchase_created_at
)
) AS processed
from new_batch
;
以类似的方式,您可以创建表而无需使用UNION ALL。例如,我用它来模拟单元测试的一些测试数据。这样,您只需在编辑器中使用 ++ 即可非常快速地完成此操作Alt
+Shift
+Down
营销渠道就是一个很好的例子。您的数据集可能包含相同类型的连续重复事件,但理想情况下您希望将每个事件与下一个不同类型的事件链接起来。当您需要获取某些内容(即事件、购买等)的列表以构建渠道数据集时,这可能很有用。使用 PARTITION BY 它使您有机会对所有以下事件进行分组,无论每个分区中存在多少个事件。
with d as (
select * from unnest([
struct('0003f' as user_pseudo_id, 12322175 as user_id, timestamp '2020-10-10 16:46:59.878 UTC' as event_timestamp, 'join_group' as event_name),
('0003',12,timestamp '2022-10-10 16:50:03.394 UTC','set_avatar'),
('0003',12,timestamp '2022-10-10 17:02:38.632 UTC','set_avatar'),
('0003',12,timestamp '2022-10-10 17:09:38.645 UTC','set_avatar'),
('0003',12,timestamp '2022-10-10 17:10:38.645 UTC','join_group'),
('0003',12,timestamp '2022-10-10 17:15:38.645 UTC','create_group'),
('0003',12,timestamp '2022-10-10 17:17:38.645 UTC','create_group'),
('0003',12,timestamp '2022-10-10 17:18:38.645 UTC','in_app_purchase'),
('0003',12,timestamp '2022-10-10 17:19:38.645 UTC','spend_virtual_currency'),
('0003',12,timestamp '2022-10-10 17:19:45.645 UTC','create_group'),
('0003',12,timestamp '2022-10-10 17:20:38.645 UTC','set_avatar')
]
) as t)
, event_data as (
SELECT
user_pseudo_id
, user_id
, event_timestamp
, event_name
, ARRAY_AGG(
STRUCT(
event_name AS event_name
, event_timestamp AS event_timestamp
)
)
OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING ) as next_events
FROM d
WHERE
DATE(event_timestamp) = "2022-10-10"
)
select
user_pseudo_id
, user_id
, event_timestamp
, event_name
, (SELECT
event_name FROM UNNEST(next_events) next_event
WHERE t.event_name != event_name
ORDER BY event_timestamp LIMIT 1
-- change to ORDER BY event_timestamp desc if prev event needed
) next_event
, (SELECT
event_timestamp FROM UNNEST(next_events) next_event
WHERE t.event_name != event_name
ORDER BY event_timestamp LIMIT 1
-- change to ORDER BY event_timestamp desc if prev event needed
) next_event_ts
from event_data t
;
如果您需要从非结构化数据中提取某些内容(例如外汇汇率、自定义分组等),您会使用它。
考虑以下汇率数据示例:
-- One or more digits (\d+), optional period (\.?), zero or more digits (\d*).
with object as
(select '{"aed":3.6732,"afn":78.45934,"all":110.586428}' as rates)
, data as (
select "usd" as base_currency,
regexp_extract_all(rates, r'"[^"]+":\d+\.?\d*') as pair
from object
)
, splits as (
select base_currency, pair, split(pair, ':') positions
from data cross join unnest (pair) as pair
)
select base_currency, pair, positions[offset(0)] as rate_currency, positions[offset(1)] as rate
from splits
;
有时您可能想要regexp
获取应用程序的主要版本、发行版本或修改版本,并创建自定义报告:
with events as (
select 'open_chat' as event_name, '10.1.0' as app_display_version union all
select 'open_chat' as event_name, '10.1.9' as app_display_version union all
select 'open_chat' as event_name, '9.1.4' as app_display_version union all
select 'open_chat' as event_name, '9.0.0' as app_display_version
)
select
app_display_version
,REGEXP_EXTRACT(app_display_version, '^[^.^]*') main_version
,safe_cast(REGEXP_EXTRACT(app_display_version, '[0-9]+.[0-9]+') as float64) release_version
,safe_cast(REGEXP_EXTRACT(app_display_version, r"^[a-zA-Z0-9_.+-]+.[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)") as int64) as mod_version
from events
;
SQL 是一个强大的工具,有助于操作数据。希望这些来自数字营销的 SQL 用例对您有用。可以帮助您完成许多项目。 SQL 片段让我的工作变得轻松,几乎每天都在使用。此外,SQL 和现代数据仓库是数据科学的必备工具。其强大的方言功能允许轻松建模和可视化数据。由于 SQL 是数据仓库和商业智能专业人员使用的语言,因此如果您想与他们共享数据,它是一个很好的选择。