# MongoDB聚合示例

## 1.基础

• Unset Alternative. For the pipeline to be consistent with earlier examples in this book, it could have included an additional $unset stage to exclude the _id field. However, partly to show another way, the example pipeline used here marks the _id field for exclusion in the $set stage by being assigned the $$REMOVE variable. ## 2.数据联结 ### (1)一对一联结 您想要生成一个报告来列出 2020 年所有的购买情况，显示每个订单的产品名称和类别，而不是产品ID。 为此您需要获取客户「订单集合orders」，并将每个订单记录与「产品集合products」中的对应的产品相联结 两个集合之间存在「多对一many:1」关系，因此在将订单与产品匹配时会产生「一对一1:1」的联结。联结将根据产品的id在双方之间使用单个字段比较。 数据包括2019-2021年间的产品products和订单orders两个集合，如下所示。 聚合管道定义如下： 执行聚合操作后应该返回代表2020年的3个订单文档，但每个订单的product_id字段替换成了两个新查找的字段product_nameproduct_category，如下所示。 • Single Field Match. The pipeline includes a lookup join between a single field from each collection. For an illustration of performing a join based on two or more matching fields, see the Multi-Field Join & One-to-Many example. • First Element Assumption. In this particular data model example, the join between the two collections is 1:1. Therefore the returned array of joined elements coming out of the lookup stage always contains precisely one array element. As a result, the pipeline extracts the data from this first array element only, using the first operator. For an illustration of performing a 1:many join instead, see the Multi-Field Join & One-to-Many example. ### (2)多字段联结及一对多 您想要生成一份报告，列出2020年每种产品所生成的所有订单。 为此，您需要将商店的「产品集合products」中的每个产品product，与产品对应的订单order（订单位于orders集合）相联结。 基于每侧两个字段的匹配，两个集合之间存在「一对多1:many」的关系。 联结需要使用两个公共字段（product_nameproduct_variation），而不是像product_id（此数据集中不存在该字段）这样的单个字段。 注意：执行一对多1:many的联结并不强制要求在每一侧通过多个字段进行联结。然而在这个例子中，在一个地方展示这两个方面被认为是有益的。 数据包括2019-2021年间的产品products和订单orders两个集合，如下所示。 聚合管道定义如下： 执行聚合操作后应该返回2个文档，代表 2020 年有一个或多个订单的两个产品，每个产品对应的订单以数组形式展示，如下所示。 • Multiple Join Fields. To perform a join of two or more fields between the two collections, you need to use a let parameter rather than specifying the localField and foreignField parameters used in a single field join. With a let parameter, you bind multiple fields from the first collection into variables ready to be used in the joining process. You use an embedded pipeline inside the lookup stage to match the _bind_ variables with fields in the second collection’s records. In this instance, because the expr operator performs an equality comparison specifically (as opposed to a range comparison), the aggregation runtime can employ an appropriate index for this match. • Reducing Array Content. The presence of an embedded pipeline in the lookup stage provides an opportunity to filter out three unwanted fields brought in from the second collection. Instead, you could use an unset stage later in the top-level pipeline to project out these unwanted array elements. If you need to perform more complex array content filtering rules, you can use the approach described in section _2. Avoid Unwinding & Regrouping Documents Just To Process Array Elements_” of the _Pipeline Performance Considerations_ chapter. ## 3.数据类型转换 ### (1)强类型转换 一组零售订单retail orders已被某三方机构导入MongoDB集合中，但是所有数据类型都丢失了，因此所有字段值被存储为字符串形式。 您想为所有文档重新建立正确的数据，并将它们复制到一个新的「清理过的」集合中。您可以在聚合管道中包含该类型的转换逻辑，因为您知道每个字段在原始记录中的结构类型。 与本文中的大多数示例不同，聚合管道将其输出写入一个集合，而不是将结果流式地返回给调用的应用程序。 订单集合orders包含三个订单文档，其中每个订单只有字符串形式的文本字段（注意，第二个文档有意缺少further_info子文档中的reported字段），如下所示。 聚合管道定义如下： 执行聚合操作后将生成一个新的名为orders_typed的集合，新集合orders_typed应该与原集合orders具有相同数量的文档、相同的字段结构和字段名称，但现在使用了合适的强类型的布尔值日期整数十进制值，集合内容如下所示。 • Boolean Conversion. The pipeline’s conversions for integers, decimals, and dates are straightforward using the corresponding expression operators, toInt, toDecimal and toDate. However, the expression operator toBool is not used for the boolean conversion. This is because toBool will convert any non-empty string to _true_ regardless of its value. As a result, the pipeline uses a switch operator to compare the lowercase version of strings with the text 'true' and 'false', returning the matching boolean. • Preserving Non-Existence. The field further_info.reported is an optional field in this scenario. The field may not always appear in a document, as illustrated by one of the three documents in the example. If a field is not present in a document, this potentially significant fact should never be lost. The pipeline includes additional logic for the further_info.reported field to preserve this information. The pipeline ensures the field is not included in the output document if it didn’t exist in the source document. A ifNull conditional operator is used, which returns the $$REMOVE marker flag if the field is missing, instructing the aggregation engine to omit it.

• Output To A Collection. The pipeline uses a $merge stage to instruct the aggregation engine to write the output to a collection rather than returning a stream of results. For this example, the default settings for $merge are sufficient. Each transformed record coming out of the aggregation pipeline becomes a new record in the target collection. The pipeline could have used a $out rather than a $merge stage. However, because $merge supports both unsharded and sharded collections, whereas $out only supports the former, $merge provides a more universally applicable example. If your aggregation needs to create a brand new unsharded collection, $out may be a little faster because the aggregation will completely replace the existing collection if it exists. Using $merge, the system has to perform more checks for every record the aggregation inserts (even though, in this case, it will be to a new collection). • Trickier Date Conversions. In this example, the date strings contain all the date parts required by the $toDate operator to perform a conversion correctly. In some situations, this may not be the case, and a date string may be missing some valuable information (e.g. which century a 2-character year string is for, such as the century 19 or 21). To understand how to deal with these cases, see the Convert Incomplete Date Strings example chapter.

### (2)残缺日期转换

• 明确的世纪：例如是1900s，还是2000s，还是其他；
• 明确的时区：例如是GMT，是IST，还是PST，还是其他；
• 三个字母的月份缩写所代表的具体语言：例如「JAN」是法语，还是英语，还是其他。

• Concatenation Explanation. In this pipeline, the text fields (e.g. '12-DEC-20 12.12.12.999000000') are each converted to date fields (e.g. 2020-12-12T12:12:12.999Z). This is achieved by concatenating together the following four example elements before passing them to the $dateFromString operator to convert to a date type: • '12-' _(day of the month from the input string + the hyphen suffix already present in the text)_ • '12' _(replacing ‘DEC’)_ • '-20' _(hard-coded hyphen + hardcoded century)_ • '20 12.12.12.999' _(the rest of input string apart from the last 6 nanosecond digits)_ • Further Reading. This example is based on the output of the blog post: Converting Gnarly Date Strings to Proper Date Types Using a MongoDB Aggregation Pipeline. ## 4.趋势分析 ### (1)特征分类（分面检索） 您希望在零售网站上提供分面检索功能，使客户能够在网页中列出的产品结果中，通过选择指定特征来改善他们的产品搜索。 按不同维度对产品进行分类是有益的，其中每个维度（即「分面facet」）对应于产品记录中的特定字段（例如「产品评级」、「产品价格」）。每个分面都应分解为子范围，以便客户可以为特定分面（例如「评级」）选择特定的子范围（例如4~5星）。 聚合管道将根据每个分面的字段（ratingprice）来分析products集合，以确定每个分面的值的分布。 产品集合products包含如下16个文档。 聚合管道定义如下： 执行聚合操作后应该返回一个单独的文档，文档包含 2 个分面（分别关闭by_price和关闭by_rating），每个分面显示「其值的子范围」以及「属于每个子范围的产品」，如下所示。 • Multiple Pipelines. The $facet stage doesn’t have to be employed for you to use the $bucketAuto stage. In most _faceted search_ scenarios, you will want to understand a collection by multiple dimensions at once (_price_ & _rating_ in this case). The $facet stage is convenient because it allows you to define various $bucketAuto dimensions in one go in a single pipeline. Otherwise, a client application must invoke an aggregation multiple times, each using a new $bucketAuto stage to process a different field. In fact, each section of a $facet stage is just a regular aggregation [sub-]pipeline, able to contain any type of stage (with a few specific documented exceptions) and may not even contain $bucketAuto or $bucket stages at all. • Single Document Result. If the result of a $facet based aggregation is allowed to be multiple documents, this will cause a problem. The results will contain a mix of records originating from different facets but with no way of ascertaining the facet each result record belongs to. Consequently, when using $facet, a single document is always returned, containing top-level fields identifying each facet. Having only a single result record is not usually a problem. A typical requirement for faceted search is to return a small amount of grouped summary data about a collection rather than large amounts of raw data from the collection. Therefore the 16MB document size limit should not be an issue. • Spread Of Ranges. In this example, each of the two employed bucketing facets uses a different granularity number scheme for spreading out the sub-ranges of values. You choose a numbering scheme based on what you know about the nature of the facet. For instance, most of the _ratings_ values in the sample collection have scores bunched between late 3s and early 4s. If a numbering scheme is defined to reflect an even spread of ratings, most products will appear in the same sub-range bucket and some sub-ranges would contain no products (e.g. ratings 2 to 3 in this example). This wouldn’t provide website customers with much selectivity on product ratings. ### (2)最大图网络 基于类似于 Twitter 的社交网络数据库，您的组织希望了解新营销活动的最佳目标。 您想要搜索「社交网络用户」的集合，其中每个用户记录文档都包含一个用户名和该用户的关注者 您将执行一个聚合管道查询，遍历每个用户记录的followed_by数组，从而确定哪个用户具有最大的 network reach ，即最广的朋友圈 请注意，为简洁起见，此示例仅使用了一个简单的数据模型。然而这不太可能是一个最佳数据模型，例如拥有很多粉丝的社交网络用户对于$graphLookup的大规模使用，抑或是在分片环境中运行。有关此类问题的更多指导，请参阅Socialite

• Following Graphs. The $graphLookup stage helps you traverse relationships between records, looking for patterns that aren’t necessarily evident from looking at each record in isolation. In this example, by looking at _Paul’s_ record in isolation, it is evident that _Paul_ has no _friends_ and thus has the lowest network reach. However, it is not obvious that _Carol_ has the greatest network reach just by looking at the number of people _Carol_ is directly followed by, which is two. _David_, for example, is followed by three people (one more than _Carol_). However, the executed aggregation pipeline can deduce that _Carol_ has the most extensive network reach. • Index Use. The $graphLookup stage can leverage the index on the field name for each of its connectToField hops.

### (3)增量分析

• Merging Results. The pipeline uses a $merge stage to instruct the aggregation engine to write the output to a collection rather than returning a stream of results. In this example, with the options you provide to $merge, the aggregation inserts a new record in the destination collection if a matching one doesn’t already exist. If a matching record already exists, it replaces the previous version.

• Incremental Updates. The example illustrates just two days of shop orders, albeit with only a few orders, to keep the example simple. At the end of each new trading day, you run the aggregation pipeline to generate the current day’s summary only. Even after the source collection has increased in size over many years, the time it takes you to bring the summary collection up to date again stays constant. In a real-world scenario, the business might expose a graphical chart showing the changing daily orders trend over the last rolling year. This charting dashboard is not burdened by the cost of periodically regenerating values for all days in the year. There could be hundreds of thousands of orders received per day for real-world retailers, especially large ones. A day’s summary may take many seconds to generate in that situation. Without an _incremental analytics_ approach, if you need to generate a year’s worth of daily summaries every time, it would take hours to refresh the business dashboard.

• Idempotency. If a retailer is aggregating tens of thousands of orders per day, then during end-of-day processing, it may choose to generate 24 hourly summary records rather than a single daily record. This provides the business with finer granularity to understand trends better. As with any software process, when generating hourly results into the summary collection, there is the risk of not fully completing if a system failure occurs. If an in-flight aggregation terminates abnormally, it may not have written all 24 summary collection records. The failure leaves the summary collection in an indeterminate and incomplete state for one of its days. However, this isn’t a problem because of the way the aggregation pipeline uses the $merge stage. When an aggregation fails to complete, it can just be re-run. When re-run, it will regenerate all the results for the day, replacing existing summary records and filling in the missing ones. The aggregation pipeline is idempotent, and you can run it repeatedly without damaging the summary collection. The overall solution is self-healing and naturally tolerant of inadvertently aborted aggregation jobs. • Retrospective Changes. Sometimes, an organisation may need to go back and correct records from the past, as illustrated in this example. For instance, a bank may need to fix a past payment record due to a settlement issue that only comes to light weeks later. With the approach used in this example, it is straightforward to re-execute the aggregation pipeline for a prior date, using the updated historical data. This will correctly update the specific day’s summary data only, to reflect the business’s current state. ## 5.数据安全 ### (1)严格视图 您有一个 persons 集合，其中不应允许特定客户端应用程序查看敏感信息。因此，您将仅提供一个筛选后的人员数据子集的只读视图。 在实际情况中，您还可以使用 MongoDB 的基于角色的访问控制（Role-Based Access Control，RBAC）来限制客户端应用程序只能访问视图而不是原始集合。 您将使用 adults 视图以两种方式限制客户端应用程序的个人数据： 1. 仅显示 18 岁及以上的人（通过检查每个人的dateofbirth字段）； 2. 从结果中排除每个人的social_security_num字段。 本质上，这是对 MongoDB 中实现「记录级（record-level）」访问控制的一个说明。 persons集合包含5条记录，如下所示。 聚合管道定义如下： 首先，在创建视图之前执行聚合操作（并观察explain），以测试该定义的聚合管道。然后创建新的adults视图，它会在任何人查询视图时自动应用聚合管道。 对创建的视图执行常规 MQL 查询，没有任何过滤条件，并观察其explain；或者创建的视图执行 MQL 查询，指定过滤器为仅返回女性成年人，观察explain注意性别过滤器gender是如何影响它的。 对于aggregate()find()命令在 视图 上执行的结果应该是一样的，都返回3个文档，表示超过18岁的3个人，但是没有显示他们实际的出生日期，如图所示： 带有"gender": "FEMALE"过滤器的find()命令在 视图 上运行的结果应该仅有2条 女性 记录，因为男性记录已被排除，如下所示： • Expr & Indexes. The “NOW” system variable used here returns the current system date-time. However, you can only access this system variable via an aggregation expression and not directly via the regular MongoDB query syntax used by MQL and $match. You must wrap an expression using $$NOW inside an expr operator. As described in the section _Restrictions When Using Expressions_ in an earlier chapter, if you use an expr query operator to perform a range comparison, you can’t make use of an index in versions of MongoDB earlier then 5.0. Therefore, in this example, unless you use MongoDB 5.0, the aggregation will not take advantage of an index on dateofbirth. For a view, because you specify the pipeline earlier than it is ever run, you cannot obtain the current date-time at runtime by other means. • View Finds & Indexes. Even for versions of MongoDB before 5.0, the explain plan for the _gender query_ run against the view shows an index has been used (the index defined for the gender field). At runtime, a view is essentially just an aggregation pipeline defined “ahead of time”. When db.adults.find({"gender": "FEMALE"}) is executed, the database engine dynamically appends a new match stage to the end of the pipeline for the gender match. It then optimises the pipeline by moving the new match stage to the pipeline’s start. Finally, it adds the filter extracted from the new match stage to the aggregation’s initial query, and hence it can then leverage an index containing the gender field. The following two excerpts, from an explain plan from a MongoDB version before 5.0, illustrate how the filter on gender and the filter on dateofbirth combine at runtime and how the index for gender is used to avoid a full collection scan: In MongoDB 5.0, the explain plan will show the aggregation runtime executing the pipeline more optimally by entirely using the compound index based on both the fields gender and dateofbirth. • Further Reading. The ability for _find_ operations on a view to automatically push filters into the view’s aggregation pipeline, and then be further optimised, is described in the blog post: Is Querying A MongoDB View Optimised? ### (2)隐藏敏感字段 由于使用\rand运算符，要求MongoDB版本最低为 4.4。 您想对 信用卡付款 集合的敏感字段执行不可逆屏蔽，准备将输出数据集提供给第 3 方进行分析，而不会将敏感信息暴露给第 3 方。 您需要对付款字段进行的具体更改是： • 部分混淆持卡人姓名； • 混淆卡号的前 12 位数字，只保留最后 4 位数字； • 通过添加或减去最多 30 天（约 1 个月）的随机金额来调整卡的到期时间； • 用一组随机的 3 位数字替换卡的 3 位安全码； • 通过添加或减去最多为原始金额 10% 的随机金额来调整交易金额； • 将大约 20% 的记录中reported字段的布尔值更改为相反值； • 如果嵌套子文档customer_infocategory字段为 RESTRICTED ，则排除整个子文档customer_info payments集合包含2条信用卡付款记录的文档，其中包含敏感数据，如下所示。 聚合管道定义如下： 执行聚合操作后应该返回与源文档对应的2个文档，但它们的许多字段都被编辑和混淆了，并且标记为RESTRICTEDcustomer_info字段被省略了，如下所示。 • Targeted Redaction. The pipeline uses a cond operator to return the $$REMOVE marker variable if the category field is equal to RESTRICTED. This informs the aggregation engine to exclude the whole customer_info sub-document from the stage’s output for the record. Alternatively, the pipeline could have used a $redact stage to achieve the same. However, $redact typically has to perform more processing work due to needing to check every field in the document. Hence, if a pipeline is only to redact out one specific sub-document, use the approach outlined in this example.

• Regular Expression. For masking the card_name field, a regular expression operator is used to extract the last word of the field’s original value. $regexFind returns metadata into the stage’s output records, indicating if the match succeeded and what the matched value is. Therefore, an additional $set stage is required later in the pipeline to extract the actual matched word from this metadata and prefix it with some hard-coded text.

• Meaningful Insight. Even though the pipeline is irreversibly obfuscating fields, it doesn’t mean that the masked data is useless for performing analytics to gain insight. The pipeline masks some fields by fluctuating the original values by a small but limited random percentage (e.g. card_expiry, transaction_amount), rather than replacing them with completely random values (e.g. card_sec_code). In such cases, if the input data set is sufficiently large, then minor variances will be equalled out. For the fields that are only varied slightly, users can derive similar trends and patterns from analysing the masked data as they would the original data.

• Further Reading. This example is based on the output of two blog posts: 1) MongoDB Irreversible Data Masking, and 2) MongoDB Reversible Data Masking.

## 6.时间序列

### (1)IOT电力消耗

• Integral Trapezoidal Rule. As documented in the MongoDB Manual, $integral _”returns an approximation for the mathematical integral value, which is calculated using the trapezoidal rule”_. For non-mathematicians, this explanation may be hard to understand. You may find it easier to comprehend the behaviour of the $integral operator by studying the illustration below and the explanation that follows:

Essentially the trapezoidal rule determines the area of a region between two points under a graph by matching the region with a trapezoid shape that approximately fits this region and then calculating the area of this trapezoid. You can see a set of points on the illustrated graph with the matched trapezoid shape underneath each pair of points. For this IOT Power Consumption example, the points on the graph represent an air-conditioning unit’s power readings captured every 30 minutes. The Y-axis is the _power rate_ in Kilowatts, and the X-axis is _time_ to indicate when the device captured each reading. Consequently, for this example, the energy consumed by the air-conditioning unit for a given hour’s span is the area of the hour’s specific section under the graph. This section’s area is approximately the area of the two trapezoids shown. Using the $integral operator for the window of time you define in the $setWindowFields stage, you are asking for this approximate area to be calculated, which is the Kilowatt-hours consumed by the air-conditioning unit in one hour.

• Window Range Definition. For every captured document representing a device reading, this example’s pipeline identifies a window of _1-hour_ of previous documents relative to this _current_ document. The pipeline uses this set of documents as the input for the $integral operator. It defines this window range in the setting range: [-1, "current"], unit: "hour". The pipeline assigns the output of the $integral calculation to a new field called consumedKilowattHours.

• One Hour Range Vs Hours Output. The fact that the $setWindowFields stage in the pipeline defines unit: "hour" in two places may appear redundant at face value. However, this is not the case, and each serves a different purpose. As described in the previous observation, unit: "hour" for the "window" option helps dictate the size of the window of the previous number of documents to analyse. However, unit: "hour" for the $integral operator defines that the output should be in hours (“Kilowatt-hours” in this example), yielding the result consumedKilowattHours: 8.5 for one of the processed device readings. However, if the pipeline defined this $integral parameter to be "unit": "minute" instead, which is perfectly valid, the output value would be 510 Kilowatt-minutes (i.e. 8.5 x 60 minutes). • Optional Time Series Collection. This example uses a time series collection to store sequences of device measurements over time efficiently. Employing a time series collection is optional, as shown in the NOTE Javascript comment in the example code. The aggregation pipeline does not need to be changed and achieves the same output if you use a regular collection instead. However, when dealing with large data sets, the aggregation will complete quicker by employing a time series collection. • Index for Partition By & Sort By. In this example, you define the index {deviceID: 1, timestamp: 1} to optimise the use of the combination of the partitionBy and sortBy parameters in the $setWindowFields stage. This means that the aggregation runtime does not have to perform a slow in-memory sort based on these two fields, and it also avoids the pipeline stage memory limit of 100 MB. It is beneficial to use this index regardless of whether you employ a regular collection or adopt a time series collection.