需求背景:
在理财 APP 中,素材、广告位、产品、策略有时候是多对多的关系。比如,在内容中台,一个素材可能关联理财、基金、存款某些产品,那我们统计该素材的好不好,转化率好不好,该归属于哪些业务?再进而计算某些业务的贡献,就可能需要用到数组。
还是不怎么看文档,因为文档的例子不够直观。
在https://community.cloud.databricks.com/ 上创建表的方法,可以参考文档,https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#examples
创表
CREATE TABLE student (name STRING, courses STRING)
插入数据
INSERT INTO student (name, courses) VALUES ('Alice', 'Math'), ('Alice', 'Science'), ('Bob', 'English'), ('Bob', 'History'), ('Bob', 'Art'), ('Charlie', 'Math'), ('Charlie', 'Art'), ('David', 'Science'), ('Emma', 'Math'), ('Emma', 'English'), ('Emma', 'Science');
数据如下:
name | courses |
---|---|
Alice | Math |
Alice | Science |
Bob | English |
Bob | History |
Bob | Art |
Charlie | Math |
Charlie | Art |
David | Science |
Emma | Math |
Emma | English |
Emma | Science |
现在需要将 student 的数据变成一个学生一行数据,分别将课程拼接成字符串和组成数组类型。代码如下:
select
name
,concat_ws(', ', collect_list(courses)) as courses
from
student
group by
name
name | courses |
---|---|
Charlie | Math, Art |
Bob | English, History, Art |
Alice | Math, Science |
Emma | Math, English, Science |
David | Science |
select
name,array_agg(courses) as courses
from
student
group by
name;
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
select name, collect_list(courses) as courses
from student
group by name
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
-- chat GPT 说这样也可以,但是我选择的版本不支持。
-- STRING_AGG 函数是 SQL:2016 标准中新增的函数,不是所有的数据库管理系统都支持该函数。
-- Spark 3.0 中,STRING_AGG 函数被引入作为 SQL:2016 标准的一部分。你可以使用 STRING_AGG 函数将每个分组的数据拼接成一个字符串。
select name, string_agg(courses, ',') as courses
from student
group by name;
其实我先是在 Excel 中自己弄成了 ,结果没有注意,courses2是字符串类型。而修改字段类型比较麻烦。表名是temp。
name | courses2 | courses |
---|---|---|
Alice | ['Math', 'Science'] | Math, Science |
Bob | ['English', 'History', 'Art'] | English, History, Art |
Charlie | ['Math', 'Art'] | Math, Art |
David | ['Science'] | Science |
Emma | ['Math', 'English', 'Science'] | Math, English,Science |
select
name
,array(collect_list(courses)) as courses
from
student
group by
name
name | courses |
---|---|
Charlie | [["Math","Art"]] |
Bob | [["English","History","Art"]] |
Alice | [["Math","Science"]] |
Emma | [["Math","English","Science"]] |
David | [["Science"]] |
如果本身表就是踩坑1的,表名是 temp,对字符串类型的courses展开,变成一行数据是每一个同学和一个科目。
select
name
,explode(split(courses, ', ')) as course
from
temp
name | course |
---|---|
Alice | Math |
Alice | Science |
Bob | English |
Bob | History |
Bob | Art |
Charlie | Math |
Charlie | Art |
David | Science |
Emma | Math |
Emma | English,Science |
如果对数组类型的courses展开(需求1中第二段代码),变成一行数据是每一个同学和一个科目。
select name,course
from
(
select
name
,array_agg(courses) as courses
from
student
group by
name
)
lateral view explode(courses) exploded_courses as course;
name | course |
---|---|
Charlie | Math |
Charlie | Art |
Bob | English |
Bob | History |
Bob | Art |
Alice | Math |
Alice | Science |
Emma | Math |
Emma | English |
Emma | Science |
David | Science |
需要统计每门课有多少同学选修?数据是学生和课程组合起来的。
select
course
,count(distinct name) as student_count
from
(
-- 踩坑1中的 temp 表,数据如需求2。
select
name,explode(split(courses, ', ')) as course
from
temp
) subquery
-- where course in ('math', 'english')
group by course;
course | student_count |
---|---|
Science | 2 |
Art | 2 |
Math | 3 |
English | 1 |
History | 1 |
English,Science | 1 |
select course,count(distinct name) as student_count
from
(
select
name
,explode(courses) as course
from
(
select
name
,array_agg(courses) as courses
from
student
group by
name
)
) as temp
group by course;
course | student_count |
---|---|
Science | 3 |
Art | 2 |
Math | 3 |
English | 2 |
History | 1 |
直接在数组类型的courses,查询选修数据的同学的所有选修课程,结果中的选修课程是数组类型
-- 创建表的第二种形式,student_copy 是
create table student_copy as
select name, collect_list(courses) as courses
from student
group by name
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
select * from student_copy where array_contains(courses, 'math')
name | courses |
---|---|
Charlie | ["Math","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
直接在数组类型的courses,查询有哪些不同的课程?
select array_distinct(courses) from student_copy
-- array_distinct(x )→ x:去重:删除数组x中重复元素。 仅对一行的数组去重,不是对整个数组字段去重
-- https://docs.ucloud.cn/usql/common/func/array
array_distinct(courses) |
---|
["Math","Art"] |
["English","History","Art"] |
["Math","Science"] |
["Math","English","Science"] |
["Science"] |
逐步靠近答案。
select collect_set(course) as courses
from (
select explode(courses) as course
from student_copy
) subquery;
courses |
---|
["Math","History","Art","Science","English"] |
select concat_ws('\n', collect_set(course)) as courses
from (
select explode(courses) as course
from student_copy
) subquery;
courses |
---|
Math History Art Science English |
select course
from (
select explode(courses) as course
from student_copy
) subquery
group by course;
course |
---|
Science |
Art |
Math |
English |
History |
一张表是同学上学期所学课程的数据,另外一种表是下学期的课程,来看下所有同学一个学期共学习了什么课程。
select
t1.name,array_append(t1.courses,t2.courses) as courses
from
student_copy as t1
left join
(SELECT * FROM student_copy WHERE array_contains(courses, 'Math')) as t2
on t1.name = t2.name
Error in SQL statement: AnalysisException: [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_append(courses, courses)" due to data type mismatch: 错误在SQL语句:分析异常: [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPESJ由于数据类型不匹配,无法解析array_append(课程、课程)
select
t1.name,array_append(t1.courses,t2.courses) as courses
from
student_copy as t1
left join
( SELECT name, courses FROM temp) as t2
on t1.name = t2.name
name | courses |
---|---|
Charlie | ["Math","Art","Math, Art"] |
Bob | ["English","History","Art","English, History, Art"] |
Alice | ["Math","Science","Math, Science"] |
Emma | ["Math","English","Science","Math, English,Science"] |
David | ["Science","Science"] |
但是没有去重,发现 "Math" ,"Art" ,"Math, Art" 不同。
select
a.name,array_distinct(a.courses) as courses
from
(
select
t1.name,array_append(t1.courses,t2.courses) as courses
from
student_copy as t1
left join
( SELECT name, courses FROM temp) as t2
on t1.name = t2.name
) as a
name | array_distinct(courses) |
---|---|
Charlie | ["Math","Art","Math, Art"] |
Bob | ["English","History","Art","English, History, Art"] |
Alice | ["Math","Science","Math, Science"] |
Emma | ["Math","English","Science","Math, English,Science"] |
David | ["Science"] |
如何排序呢?
select
t1.name,array_sort(t1.courses) as courses
from
(
select
name,array_agg(courses) as courses
from
students
group by
name
) as t1
t1的数据是:
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
查出来的数据:
name | courses |
---|---|
Charlie | ["Art","Math"] |
Bob | ["Art","English","History"] |
Alice | ["Math","Science"] |
Emma | ["English","Math","Science"] |
David | ["Science"] |
总结:
基本总结了数组的合并展开,分组统计。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。