云数据库进阶1：聚合操作

大帅老猿

发布于 2022-04-13 19:54:50

5K0

云数据库的聚合操作

有时候我们需要对数据进行分析操作，比如一些统计操作、联表查询等，这个时候简单的查询操作就搞不定这些需求，因此就需要使用聚合操作来完成。

获取云数据库集合的聚合操作实例

db.collection('scores').aggregate();

注意：云函数中使用时切勿复用aggregate实例，容易引发Bug。

以下两种写法就是错误的

错误示例一

const db = uniCloud.database()
const collection = db.collection('test')
const aggregate = collection.aggregate() // 云函数实例复用时，此聚合实例也会复用，导致Bug

exports.main = async function(){
  const res = await aggregate.match({a:1}).end()
  return {res}
}

错误示例二

const db = uniCloud.database()
const collection = db.collection('test')

exports.main = async function(){
  const aggregate = collection.aggregate() // 此聚合实例分别在两个请求内使用，导致Bug
  const res1 = await aggregate.match({a:1}).end()
  const res2 = await aggregate.match({a:2}).end()
  return {res1, res2}
}

聚合操作

对聚合操作实例的一系列指令就被称之为聚合操作，.end()标识着一段聚合操作指令的截止并返回数据。

let res = await db.collection('articles')
  .aggregate()
  ...//聚合操作指令
  ...//聚合操作指令
  ...//聚合操作指令
  .end();

聚合操作中有部分指令和普通云数据库的基础操作指令类似，但叫法不同，云函数的基础操作指令大家都熟悉了。

下面我会介绍一些常用的聚合操作指令，并且会列出有相似功能的基础操作指令来方便大家理解。

project

将集合中的指定字段传递给下一个聚合操作，指定的字段可以是已经存在的字段，也可以是计算出来的新字段。

功能类似基本操作指令中的field()

假设我们有一个 articles 集合，其中含有以下文档：

{
  "_id": 666,
  "title": "This is title",
  "author": "Nobody",
  "isbn": "123456789",
  "introduction": "......"
}

指定包含某些字段

下面的代码使用 project，让输出只包含 _id、title 和 author 字段：

let res = await db.collection('articles')
  .aggregate()
  .project({
    title: 1,
    author: 1
  })
  .end();

输出如下：

{ "_id" : 666, "title" : "This is title", "author" : "Nobody" }

去除输出中的 `_id` 字段

_id 是默认包含在输出中的，如果不想要它，可以指定去除它：

let res = await db.collection('articles')
  .aggregate()
  .project({
    _id: 0,  // 指定去除 _id 字段
    title: 1,
    author: 1
  })
  .end()

输出如下：

{ "title" : "This is title", "author" : "Nobody" }

去除某个非 _id 字段

我们还可以指定在输出中去掉某个非 _id 字段，这样其它字段都会被输出：

let res = await db.collection('articles')
  .aggregate()
  .project({
    isbn: 0,  // 指定去除 isbn 字段
  })
  .end()

输出如下，相比输入，没有了 isbn 字段：

{
  "_id" : 666,
  "title" : "This is title",
  "author" : "Nobody",
  "introduction": "......"
}

加入计算出的新字段

假设我们有一个 students 集合，其中包含以下文档：

{
  "_id": 1,
  "name": "小明",
  "scores": {
    "chinese": 80,
    "math": 90,
    "english": 70
  }
}

下面的代码，我们使用 project，在输出中加入了一个新的字段 totalScore：

const { sum } = db.command.aggregate
let res = await db.collection('students')
  .aggregate()
  .project({
    _id: 0,
    name: 1,
    totalScore: sum([
      "$scores.chinese",
      "$scores.math",
      "$scores.english"
    ])
  })
  .end()

输出为：

{ "name": "小明", "totalScore": 240 }

加入新的数组字段

假设我们有一个 points 集合，包含以下文档：

{ "_id": 1, "x": 1, "y": 1 }
{ "_id": 2, "x": 2, "y": 2 }
{ "_id": 3, "x": 3, "y": 3 }

下面的代码，我们使用 project，把 x 和 y 字段，放入到一个新的数组字段 coordinate 中：

let res = await db.collection('points')
  .aggregate()
  .project({
    coordinate: ["$x", "$y"]
  })
  .end()

输出如下：

{ "_id": 1, "coordinate": [1, 1] }
{ "_id": 2, "coordinate": [2, 2] }
{ "_id": 3, "coordinate": [3, 3] }

addFields

添加新字段到输出的记录。经过 addFields 聚合指令，输出的所有记录中除了输入时带有的字段外，还将带有 addFields 指定的字段。

注意事项：

addFields 等同于同时指定了所有已有字段和新增字段的 project 指令。
addFields 可指定多个新字段，每个新字段的值由使用的表达式决定。
如果指定的新字段与原有字段重名，则新字段的值会覆盖原有字段的值。
注意 addFields 不能用来给数组字段添加元素。

连续两次 addFields

假设集合 scores 有如下记录：

{
  _id: 1,
  student: "Maya",
  homework: [ 10, 5, 10 ],
  quiz: [ 10, 8 ],
  extraCredit: 0
},
{
  _id: 2,
  student: "Ryan",
  homework: [ 5, 6, 5 ],
  quiz: [ 8, 8 ],
  extraCredit: 8
}

应用两次 addFields，第一次增加两个字段分别为 homework 和 quiz 的和值，第二次增加一个字段再基于上两个和值求一次和值。

const $ = db.command.aggregate

let res = await db.collection('scores').aggregate()
  .addFields({
    totalHomework: $.sum('$homework'),
    totalQuiz: $.sum('$quiz')
  })
  .addFields({
    totalScore: $.add(['$totalHomework', '$totalQuiz', '$extraCredit'])
  })
  .end()

返回结果如下：

{
  "_id" : 1,
  "student" : "Maya",
  "homework" : [ 10, 5, 10 ],
  "quiz" : [ 10, 8 ],
  "extraCredit" : 0,
  "totalHomework" : 25,
  "totalQuiz" : 18,
  "totalScore" : 43
},
{
  "_id" : 2,
  "student" : "Ryan",
  "homework" : [ 5, 6, 5 ],
  "quiz" : [ 8, 8 ],
  "extraCredit" : 8,
  "totalHomework" : 16,
  "totalQuiz" : 16,
  "totalScore" : 40
}

在嵌套记录里增加字段

可以用点表示法在嵌套记录里增加字段。假设 vehicles 集合含有如下记录：

{ _id: 1, type: "car", specs: { doors: 4, wheels: 4 } }
{ _id: 2, type: "motorcycle", specs: { doors: 0, wheels: 2 } }
{ _id: 3, type: "jet ski" }

可以用如下操作在 specs 字段下增加一个新的字段 fuel_type，值都设为固定字符串 unleaded：

let res = await db.collection('vehicles').aggregate()
  .addFields({
    'specs.fuel_type': 'unleaded'
  })
  .end()

返回结果如下：

{ _id: 1, type: "car", specs: { doors: 4, wheels: 4, fuel_type: "unleaded" } },
{ _id: 2, type: "motorcycle", specs: { doors: 0, wheels: 2, fuel_type: "unleaded" } },
{ _id: 3, type: "jet ski", specs: { fuel_type: "unleaded" } }

设置字段值为另一个字段

可以通过 $ 加字段名组成的字符串作为值的表达式来设置字段的值为另一个字段的值。

同样用上一个集合示例，可以用如下操作添加一个字段 vehicle_type，将其值设置为 type 字段的值：

let res = await db.collection('vehicles').aggregate()
  .addFields({
    vehicle_type: '$type'
  })
  .end()

返回结果如下：

{ _id: 1, type: "car", vehicle_type: "car",  specs: { doors: 4, wheels: 4, fuel_type: "unleaded" } },
{ _id: 2, type: "motorcycle", vehicle_type: "motorcycle",  specs: { doors: 0, wheels: 2, fuel_type: "unleaded" } },
{ _id: 3, type: "jet ski", vehicle_type: "jet ski",  specs: { fuel_type: "unleaded" } }

sample

从集合中随机选取指定数量的记录条目。

sample({
    size: 10//随机取10条记录
})

很好用和实用的一个聚合操作指令，比如我们要从面试题库中随机出三道题，那么直接使用这个指令就可以实现了。不过需要注意此方法在数据量大的集合高频调用时可能会导致响应缓慢。

假设文档 users 有以下记录：

{ "name": "张三" }
{ "name": "李四" }

随机选取

如果现在进行抽奖活动，需要选出一名幸运用户。那么 sample 的调用方式如下：

let res = await db.collection('users')
  .aggregate()
  .sample({
    size: 1
  })
  .end()

返回了随机选中的一个用户对应的记录，结果如下：

{ "_id": "696529e4-7e82-4e7f-812e-5144714edff6", "name": "李四" }

skip

跳过指定数量的记录，输出剩下的记录。

功能类似基本操作指令中的skip()

let res = await db.collection('users')
  .aggregate()
  .skip(5)
  .end()

这段代码会跳过查找到的前 5 个记录，并且把剩余的记录输出。

geoNear

将记录按照离给定点从近到远输出。

属性	类型	必填	说明
near	GeoPoint	是	GeoJSON Point，用于判断距离的点
spherical	true	是	必填，值为 true
maxDistance	number	否	距离最大值
minDistance	number	否	距离最小值
query	Object	否	要求记录必须同时满足该条件（语法同 where）
distanceMultiplier	number	否	返回时在距离上乘以该数字
distanceField	string	是	存放距离的输出字段名，可以用点表示法表示一个嵌套字段
includeLocs	string	否	列出要用于距离计算的字段，如果记录中有多个字段都是地理位置时有用
key	string	否	选择要用的地理位置索引。如果集合由多个地理位置索引，则必须指定一个，指定的方式是指定对应的字段

注意事项：

geoNear 必须为第一个聚合操作指令。
必须有地理位置索引。如果有多个，必须用 key 参数指定要使用的索引。

假设集合 attractions 有如下记录：

{
  "_id": "geoNear.0",
  "city": "Guangzhou",
  "docType": "geoNear",
  "location": {
    "type": "Point",
    "coordinates": [
      113.30593,
      23.1361155
    ]
  },
  "name": "Canton Tower"
},
{
  "_id": "geoNear.1",
  "city": "Guangzhou",
  "docType": "geoNear",
  "location": {
    "type": "Point",
    "coordinates": [
      113.306789,
      23.1564721
    ]
  },
  "name": "Baiyun Mountain"
},
{
  "_id": "geoNear.2",
  "city": "Beijing",
  "docType": "geoNear",
  "location": {
    "type": "Point",
    "coordinates": [
      116.3949659,
      39.9163447
    ]
  },
  "name": "The Palace Museum"
},
{
  "_id": "geoNear.3",
  "city": "Beijing",
  "docType": "geoNear",
  "location": {
    "type": "Point",
    "coordinates": [
      116.2328567,
      40.242373
    ]
  },
  "name": "Great Wall"
}

使用geoNear指令：

const $ = db.command.aggregate
let res = await db.collection('attractions').aggregate()
  .geoNear({
    distanceField: 'distance', // 输出的每个记录中 distance 即是与给定点的距离
    spherical: true,
    near: new db.Geo.Point(113.3089506, 23.0968251),
    query: {
      docType: 'geoNear',
    },
    key: 'location', // 若只有 location 一个地理位置索引的字段，则不需填
    includeLocs: 'location', // 若只有 location 一个是地理位置，则不需填
  })
  .end()

返回结果如下：

{
  "_id": "geoNear.0",
  "location": {
    "type": "Point",
    "coordinates": [
      113.30593,
      23.1361155
    ]
  },
  "docType": "geoNear",
  "name": "Canton Tower",
  "city": "Guangzhou",
  "distance": 4384.68131486958
},
{
  "_id": "geoNear.1",
  "city": "Guangzhou",
  "location": {
    "type": "Point",
    "coordinates": [
      113.306789,
      23.1564721
    ]
  },
  "docType": "geoNear",
  "name": "Baiyun Mountain",
  "distance": 6643.521654040738
},
{
  "_id": "geoNear.2",
  "docType": "geoNear",
  "name": "The Palace Museum",
  "city": "Beijing",
  "location": {
    "coordinates": [
      116.3949659,
      39.9163447
    ],
    "type": "Point"
  },
  "distance": 1894750.4414538583
},
{
  "_id": "geoNear.3",
  "docType": "geoNear",
  "name": "Great Wall",
  "city": "Beijing",
  "location": {
    "type": "Point",
    "coordinates": [
      116.2328567,
      40.242373
    ]
  },
  "distance": 1928300.3308822548
}

group

类似SQL的distinct功能。将输入记录按给定表达式分组，输出时每个记录代表一个分组，每个记录的 _id 是区分不同组的 key。输出记录中也可以包括累计值，将输出字段设为累计值即会从该分组中计算累计值。

注意事项：group操作有 100M 内存使用限制

group({
  _id: <expression>,
  <field1>: <accumulator1>,
  ...
  <fieldN>: <accumulatorN>
})

_id 参数是必填的，如果填常量则只有一组。其他字段是可选的，都是累计值，用 .sum 等累计器(const = db.command.aggregate)，但也可以使用其他表达式。

累计器必须是以下操作符之一：

操作符	说明
addToSet	向数组中添加值，如果数组中已存在该值，不执行任何操作
avg	返回一组集合中，指定字段对应数据的平均值
sum	计算并且返回一组字段所有数值的总和
first	返回指定字段在一组集合的第一条记录对应的值。仅当这组集合是按照某种定义排序（ sort ）后，此操作才有意义。
last	返回指定字段在一组集合的最后一条记录对应的值。仅当这组集合是按照某种定义排序（ sort ）后，此操作才有意义。
max	返回一组数值的最大值
min	返回一组数值的最小值
push	在 group 阶段，返回一组中表达式指定列与对应的值，一起组成的数组
stdDevPop	返回一组字段对应值的标准差
stdDevSamp	计算输入值的样本标准偏差。如果输入值代表数据总体，或者不概括更多的数据，请改用 db.command.aggregate.stdDevPop
mergeObjects	将多个文档合并为单个文档

按字段值分组

假设集合 avatar 有如下记录：

{
  _id: "1",
  alias: "john",
  region: "asia",
  scores: [40, 20, 80],
  coins: 100
},
{
  _id: "2",
  alias: "arthur",
  region: "europe",
  scores: [60, 90],
  coins: 20
},
{
  _id: "3",
  alias: "george",
  region: "europe",
  scores: [50, 70, 90],
  coins: 50
},
{
  _id: "4",
  alias: "john",
  region: "asia",
  scores: [30, 60, 100, 90],
  coins: 40
},
{
  _id: "5",
  alias: "george",
  region: "europe",
  scores: [20],
  coins: 60
},
{
  _id: "6",
  alias: "john",
  region: "asia",
  scores: [40, 80, 70],
  coins: 120
}

const $ = db.command.aggregate

let res = await db.collection('avatar').aggregate()
  .group({
    _id: '$alias',
    num: $.sum(1)
  })
  .end()

返回结果如下：

{
  "_id": "john",
  "num": 3
},
{
  "_id": "authur",
  "num": 1
},
{
  "_id": "george",
  "num": 2
}

按多个值分组

可以给 _id 传入记录的方式按多个值分组。还是沿用上面的示例数据，按各个区域（region）获得相同最高分（score）的来分组，并求出各组虚拟币（coins）的总量：

const $ = db.command.aggregate

let res = await db.collection('avatar').aggregate()
  .group({
    _id: {
      region: '$region',
      maxScore: $.max('$scores')
    },
    totalCoins: $.sum('$coins')
  })
  .end()

返回结果如下：

{
  "_id": {
    "region": "asia",
    "maxScore": 80
  },
  "totalCoins": 220
},
{
  "_id": {
    "region": "asia",
    "maxScore": 100
  },
  "totalCoins": 40
},
{
  "_id": {
    "region": "europe",
    "maxScore": 90
  },
  "totalCoins": 70
},
{
  "_id": {
    "region": "europe",
    "maxScore": 20
  },
  "totalCoins": 60
}

match

根据条件过滤文档，并且把符合条件的文档传递给下一个流水线阶段。

功能类似基本操作指令中的where()

查询条件与普通查询一致，可以用普通查询操作符，注意 match 阶段和其他聚合阶段不同，不可使用聚合操作符，只能使用查询操作符。

// 直接使用字符串
match({
  name: 'Tony Stark'
})

// 使用操作符
const dbCmd = db.command

match({
  age: [dbCmd.gt](http://dbCmd.gt)(18)
})

假设集合 articles 有如下记录：

{ "_id" : "1", "author" : "stark",  "score" : 80 }
{ "_id" : "2", "author" : "stark",  "score" : 85 }
{ "_id" : "3", "author" : "bob",    "score" : 60 }
{ "_id" : "4", "author" : "li",     "score" : 55 }
{ "_id" : "5", "author" : "jimmy",  "score" : 60 }
{ "_id" : "6", "author" : "li",     "score" : 94 }
{ "_id" : "7", "author" : "justan", "score" : 95 }

匹配

下面是一个直接匹配的例子：

let res = await db.collection('articles')
  .aggregate()
  .match({
    author: 'stark'
  })
  .end()

这里的代码尝试找到所有 author 字段是 stark 的文章，那么匹配如下：

{ "_id" : "1", "author" : "stark", "score" : 80 }
{ "_id" : "2", "author" : "stark", "score" : 85 }

计数

match 过滤出文档后，还可以与其他流水线阶段配合使用。

比如下面这个例子，我们使用 group 进行搭配，计算 score 字段大于 80 的文档数量：

const dbCmd = db.command;
const $ = dbCmd.aggregate;

let res = await db.collection('articles')
  .aggregate()
  .match({
    score: [dbCmd.gt](http://dbCmd.gt)(80)
  })
  .group({
    _id: null,
    count: $.sum(1)
  })
  .end();

返回值如下：

{ "_id" : null, "count" : 3 }

limit

限制输出到下一阶段的记录数

功能类似基本操作指令中的limit()

count

计算上一聚合阶段输入到本阶段的记录数，输出一个记录，其中指定字段的值为记录数

功能类似基本操作指令中的*count()*

sort

根据指定的字段，对输入的记录进行排序。

类似基本操作指令中的*orderBy()*

假设我们有集合 students，其中包含数据如下：

{ "_id": "1", "author": "stark",  "score": 80, "age": 18 }
{ "_id": "2", "author": "bob",    "score": 60, "age": 18 }
{ "_id": "3", "author": "li",     "score": 55, "age": 19 }
{ "_id": "4", "author": "jimmy",  "score": 60, "age": 22 }
{ "_id": "5", "author": "justan", "score": 95, "age": 33 }

使用sort对其排序，先根据 age 字段降序排列，然后再根据 score 字段进行降序排列

let res = await db.collection('students')
  .aggregate()
  .sort({
      age: -1,
      score: -1
  })
  .end()

返回的记录如下：

{ "_id": "5", "author": "justan", "score": 95, "age": 33 }
{ "_id": "4", "author": "jimmy",  "score": 60, "age": 22 }
{ "_id": "3", "author": "li",     "score": 55, "age": 19 }
{ "_id": "1", "author": "stark",  "score": 80, "age": 18 }
{ "_id": "2", "author": "bob",    "score": 60, "age": 18 }

sortByCount

将传入的集合进行分组（group）。然后计算不同组的数量，并且将这些组按照它们的数量进行排序，返回排序后的结果。

注意表达式的形式是：

+ 指定字段。请注意：不要漏写

符号。

统计基础类型

假设集合 passages 的记录如下：

{ "category": "Web" }
{ "category": "Web" }
{ "category": "Life" }

下面的代码就可以统计文章的分类信息，并且计算每个分类的数量。即对 category 字段执行 sortByCount 聚合操作。

let res = await db.collection('passages')
  .aggregate()
  .sortByCount('$category')
  .end()

返回的结果如下所示：Web 分类下有2篇文章，Life 分类下有1篇文章。

{ "_id": "Web", "count": 2 }
{ "_id": "Life", "count": 1 }

解构数组类型

假设集合 passages 的记录如下：tags 字段对应的值是数组类型。

{ "tags": [ "JavaScript", "C#" ] }
{ "tags": [ "Go", "C#" ] }
{ "tags": [ "Go", "Python", "JavaScript" ] }

如何统计文章的标签信息，并且计算每个标签的数量？因为 tags 字段对应的数组，所以需要借助 unwind 操作解构 tags 字段，然后再调用 sortByCount。

下面的代码实现了这个功能：

let res = await db.collection('passages')
  .aggregate()
  .unwind(`$tags`)
  .sortByCount(`$tags`)
  .end()

返回的结果如下所示：

{ "_id": "Go", "count": 2 }
{ "_id": "C#", "count": 2 }
{ "_id": "JavaScript", "count": 2 }
{ "_id": "Python", "count": 1 }

unwind

使用指定的数组字段中的每个元素，对记录进行拆分。拆分后，记录会从一个变为一个或多个，分别对应数组的每个元素。

unwind 有两种使用形式：

参数是一个字段名

unwind(<字段名>)

参数是一个对象

unwind({
  path: <字段名>,
  includeArrayIndex: <string>,
  preserveNullAndEmptyArrays: <boolean>
})

字段	类型	说明
path	string	想要拆分的数组的字段名，需要以 $ 开头。
includeArrayIndex	string	可选项，传入一个新的字段名，数组索引会保存在这个新的字段上。新的字段名不能以 $ 开头。
preserveNullAndEmptyArrays	boolean	如果为 true，那么在 path 对应的字段为 null、空数组或者这个字段不存在时，依然会输出这个文档；如果为 false，unwind 将不会输出这些记录。默认为 false。

拆分数组

假设我们有一个 products 集合，包含数据如下：

{ "_id": "1", "product": "tshirt", "size": ["S", "M", "L"] }
{ "_id": "2", "product": "pants", "size": [] }
{ "_id": "3", "product": "socks", "size": null }
{ "_id": "4", "product": "trousers", "size": ["S"] }
{ "_id": "5", "product": "sweater", "size": ["M", "L"] }

我们根据 size 字段对这些记录进行拆分

db.collection('products')
  .aggregate()
  .unwind('$size')
  .end()

输出如下：

{ "_id": "1", "product": "tshirt", "size": "S" }
{ "_id": "1", "product": "tshirt", "size": "M" }
{ "_id": "1", "product": "tshirt", "size": "L" }
{ "_id": "4", "product": "trousers", "size": "S" }
{ "_id": "5", "product": "sweater", "size": "M" }
{ "_id": "5", "product": "sweater", "size": "L" }

拆分后，保留原数组的索引

我们根据 size 字段对记录进行拆分后，想要保留原数组索引在新的 index 字段中。

let res = await db.collection('products')
  .aggregate()
  .unwind({
    path: '$size',
    includeArrayIndex: 'index'
  })
  .end()

输出如下：

{ "_id": "1", "product": "tshirt", "size": "S", "index": 0 }
{ "_id": "1", "product": "tshirt", "size": "M", "index": 1 }
{ "_id": "1", "product": "tshirt", "size": "L", "index": 2 }
{ "_id": "4", "product": "trousers", "size": "S", "index": 0 }
{ "_id": "5", "product": "sweater", "size": "M", "index": 0 }
{ "_id": "5", "product": "sweater", "size": "L", "index": 1 }

保留字段为空的记录

注意到我们的集合中有两行特殊的空值数据：

...
{ "_id": "2", "product": "pants", "size": [] }
{ "_id": "3", "product": "socks", "size": null }
...

如果想要在输出中保留 size 为空数组、null，或者 size 字段不存在的文档，可以使用 preserveNullAndEmptyArrays 参数

let res = await db.collection('products')
  .aggregate()
  .unwind({
    path: '$size',
    preserveNullAndEmptyArrays: true
  })
  .end()

输出如下：

{ "_id": "1", "product": "tshirt", "size": "S" }
{ "_id": "1", "product": "tshirt", "size": "M" }
{ "_id": "1", "product": "tshirt", "size": "L" }
{ "_id": "2", "product": "pants", "size": null }
{ "_id": "3", "product": "socks", "size": null }
{ "_id": "4", "product": "trousers", "size": "S" }
{ "_id": "5", "product": "sweater", "size": "M" }
{ "_id": "5", "product": "sweater", "size": "L" }

小结

云数据库聚合操作的知识点相对比较多，本文已经去掉了一些不太常用的云数据库聚合操作。还有一个联表查询的操作相对复杂，但在日常工作中又经常用到，所以我们将在下一小节中专门学习聚合操作的联表查询

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-03-30，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

本文分享自大帅老猿微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

编程算法