我想知道如何最好地在BigQuery standardSQL (而不是旧版)中做一个翻滚和滑动的窗口。
Apache使这非常容易--对于长度为10分钟的‘滑动’窗口,每5分钟滑动一次,groupBy(“10分钟”,"5分钟“)和一个翻滚的窗口groupBy(”10分钟“)。
考虑到我有一个简化的订单行,其中包括:
orderId,
orderPlacedTimestamp,
orderTotals.grandTotalNet,
orderTotals.grandTotalGross (请注意总数上的结构)
我需要两个总订单销售价值如下:
我是从使用SQL over开始的,但在分区、时间戳和获得正确的开始和结束窗口方面有点混乱。对于每个结果集,应该是:
windowStartTime,
windowEndTime,
windowTotalAmount因此,窗口具有起始和结束时间(这对于滑动至关重要,因为窗口10的长度与滑动持续时间5不同,因此给定的行可以在多个窗口中)。
我如何在BigQuery中做到这一点?
使用当前SQL更新11/06:
SELECT
TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) as startWindowTime,
TIMESTAMP_SECONDS(LAST_VALUE(ts_5min*5*60) OVER(w)) as endWindowTime,
SUM(orderTotalNetConverted) OVER(ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) as windowSalesTotal
FROM (
SELECT
CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
AS ts_5min,
orderTotalNetConverted
FROM orders
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)
ORDER BY startWindowTime desc发布于 2017-06-10 16:05:10
滚动窗口更容易完成,它只是一个正常的每小时一次的GROUP BY:
SELECT
TIMESTAMP_TRUNC(orderPlacedTimestamp, HOUR),
SUM(orderTotals.grandTotalNet)
FROM T
GROUP BY 1对于滑动窗口,我首先使用以下方法将时间戳标准化为5分钟一次:
TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE)到达极小边界UNIX_SECONDS转换为自纪元以来的秒数CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
现在您可以使用标准的OVER()子句来获得10分钟窗口,这意味着一次有两个这样的间隔,为了获得开始时间,可以使用FIRST_VALUE解析函数:
SELECT
orderId,
TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) startWindowTime,
TIMESTAMP_ADD(TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)),
INTERVAL 10 MINUTE) endWindowTime,
SUM(grandTotalNet) OVER(w)
FROM (
SELECT
*,
CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
AS ts_5min
FROM t
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)发布于 2017-06-10 16:55:17
下面是用于sliding 10 MINUTE / 5 MINUTE window的
#standardSQL
WITH starts AS (
SELECT TIMESTAMP_ADD(TIMESTAMP_TRUNC(first, HOUR), INTERVAL step MINUTE) AS start
FROM
(SELECT MIN(orderPlacedTimestamp) AS first, MAX(orderPlacedTimestamp) AS last FROM YourTable),
UNNEST(GENERATE_ARRAY(0, TIMESTAMP_DIFF(last, TIMESTAMP_TRUNC(first, HOUR), MINUTE) , 5)) AS step
)
SELECT
start,
SUM(orderTotals.grandTotalNet) AS net,
SUM(orderTotals.grandTotalGross) AS gross
FROM starts AS s JOIN YourTable AS t
ON t.orderPlacedTimestamp BETWEEN s.start AND TIMESTAMP_ADD(start, INTERVAL 10 MINUTE)
GROUP BY start
ORDER BY start 正如Mosha在他的回答中所提到的,每小时滚动窗口很容易,但是如果您需要不同的窗口侧,则上述方法更灵活(我认为),并且对于任何窗口大小都很容易调整,可以通过在查询代码中分别更改5和10来输入。
但是总的来说,想要提到的是,使用解析函数比加入更理想。
https://stackoverflow.com/questions/44472364
复制相似问题