Before we jump into the aggregation pipeline and the group stage, we need some data to work with. I’m taking an example
Movies collection for understanding the concept here. Again, there’ll be links to the playground for each query throughout the article.
Movies collection with only 5 documents containing random data:
Now that we have our sample collection, it’s time to explore the
To find the distinct items in a collection we can use the group stage on any field that we want to group by. This field will be unique in the output. Let’s group the movies by their release year:
Here’s the output of the above query. Note that we only got unique release year values in the output.
Similar to grouping by a single field, we might want to group the data with more than one field as per our use case. MongoDB aggregation pipeline allows us to group by as many fields as we want.
Whatever we put inside the
_id field is used to group the documents ie, it returns all the fields present inside the
_id field and groups by all of them.
Let’s group the movies by their release year and their runtime:
Grouping by release year and their runtime gives us this output:
Instead of using a single field to group by, we are using multiple fields in the above scenario. The combination of release year and runtime acts as the unique identifier for each document.
There are a lot of accumulator functions available in the group stage which can be used to aggregate the data. They help us carry out some of the most common operations on the grouped data. Let’s take a look at some of them:
It is used to count the number of documents in the group. This can be combined with our group-by query to get the total number of documents in the group.
Let’s apply this to our movies collection:
We’ll get the total movies released each year:
We can use the
$sum accumulator to add up all the values in a field. Let’s group the movies by their rating and sum up the reviews to understand if there’s a correlation between movie rating and the number of reviews.
And here we can see that there is a slight correlation between the number of reviews and the movie rating. They seem to be loosely directly proportional for our sample movies collection:
We might want to examine which year has the highest average movies rating for analytical purposes. Let’s see how we can get those stats from our data using the
We are first grouping the movies by the release year and then calculating the average rating for each release year. Here’s the output of the above query:
We want to look at all the ratings for every release year. We can use this data to analyze which year had major variations in the movie ratings. Let’s use the
$push accumulator for this job:
All the movie ratings for each release year are pushed into an array:
You can consider this to be like the
$addToSet only adds the value to the array if it doesn’t already exist. This is the only difference between the two accumulators.
Let’s group the movies by their ratings and collect all the unique release years against each rating:
We get ratings along with their unique release years:
Let’s say we want to find out successful release years for the movies. A year is considered successful if all the movies released during that year have a rating greater than 7. Let’s use the
$min accumulator to get the successful years:
- We have grouped the movies collection using the
minRatingfield maintains the minimum rating for each release year.
- Finally, a
$matchstage to filter out the years which don’t have a minimum rating greater than 7.
This accumulator is slightly different from the
$first array operator which gives the first element in an array. For each grouped document,
$first accumulator gives us the first one.
Let’s fetch the highest-rated movie for every release year. Since we want to get the highest-rated document from each group, we need to sort the documents before passing them to the group stage.
We are sorting using two fields here,
rating. Let’s understand the output of the
$sort first stage:
The output is first sorted on the basis of ascending release year and then for each year, the movies are sorted in descending order of rating.
This sorted output is then passed to the
$group stage that groups the documents by their release year. For example, the
$group stage is working with two documents for the release year 2005:
Let’s call these shortlisted documents for the release year 2005. This happens for all unique release years. The
$group stage picks the first element from these shortlisted documents (which has the highest rating because ratings are sorted in descending order).
$group stages, here’s the final output of the query:
NOTE: Passing sorted documents to the
$groupstage does not guarantee that the order will be preserved.
Combine $group stage with $project
The movie rating is a floating-point number. We’ll round that off to the nearest integer to get the movie rating as a whole number. Finally, we’ll group movies by their modified ratings.
The query gives us the number of movies with a particular (modified) rating:
- We used
$projectstage to round off the rating to the nearest integer.
- Then, a
$groupstage to group the movies by their modified rating.
Here’s the output of the above query:
The possibilities are endless. You can combine many other stages, perform some filters, put conditions, or even
$$REMOVE the documents.
Sort the results with $sort
The year with the highest movie minutes might give us some insights on movies production and its correlation with audience attention spans over the years. So let’s understand how to achieve that with the following query:
We are fetching the total runtime of all the movies released in a particular year and then sorting them in descending order with the help of
It is evident from this query that the attention spans of the target audience have been decreasing in a non-uniform way over the years.
$group vs $project stage
We have an n:1 relationship between input and output documents in the
$group stage. But, we have a 1:1 relationship in the
$group stage, we usually get a count, sum, average of documents based on the grouping key (or
_id), or even build an array. All of these operations take n number of documents and the output of the group is a single document with the aggregated values.
On the other hand, we include/exclude fields, perform field transformations within a single document in case of project stage in the aggregation pipeline:
$group stage has a limit of 100 megabytes of RAM
If you’re working with a massive dataset and you receive an error during group stage execution, you might be hitting the memory limit. If you want to increase it, use the
allowDiskUse option to enable the
$group stage to write to temporary files on disk.
The reason for this issue is very well stated in the official MongoDB documentation:
“Pipeline stages operate on streams of documents with each pipeline stage taking in documents, processing them, and then outputting the resulting documents.
Some stages can’t output any documents until they have processed all incoming documents. These pipelines stages must keep their stage output in RAM until all incoming documents are processed. As a result, these pipeline stages may require more space than the 100 MB limit.” — MongoDB docs
That’s it! This was the introduction to how the group stage works in the MongoDB aggregation pipeline. We looked at how we can group data using single fields (distinct count), multiple fields, sort them, how we can carry out complex computations by adding conditions to the
$group stage, and the subtle difference between
Finally, I’d like to thank you for taking the time to read and understand the concept. I hope this helps deepen your understanding of how group aggregation works in MongoDB.
Want to Connect? Here's an excerciseTo solidify your understanding, I have curated a couple of questions related to what we’ve learned in this article. You can download the exercise PDF below. It contains MongoDB playground links containing the answers to all the questions:5 quick questions on the group stage with answers