The EDB Blog

Comparison of JOINS: MongoDB vs. PostgreSQL

March 31, 2020

 

This is the second in a continuing series comparing MongoDB capabilities with those of Postgres. The first post covered “Schema Later Considered Harmful.”

In this blog post we review the JOIN capabilities in both MongoDB and Postgres. The conclusion is that MongoDB joins are very brittle (when things change, application programs must be extensively recoded), and often MongoDB offers very poor performance, relative to Postgres. We begin in Section 2 with MongoDB support for joins, then continue in Section 3 with the corresponding capabilities in Postgres. In Section 4 we show why MongoDB joins are brittle, and finally in Section 5 we consider join performance in both systems.  

Throughout this blog post, we will use the well-known example of employees and departments with the relational schema in Table 1. Here, employees have an ename, an age, a salary and are in a single department.  In turn, departments have a dname, a floor and a budget.  Note that there is a department (Candy) with no employees.

employee

ename

age

salary

dname

 

Bill

36

10000

Shoe

 

Sam

27

15000

Toy

 

Fred

29

12000

Shoe

 

department

dname

floor

budget

 

Shoe

1

1200

 

Toy

2

1400

 

Candy

1

900

Example Database Table 1

At some later time, management may decide that Bill can split his time between multiple departments. In this case the schema in Table 1 is no longer valid, and the data must be altered to that in Table 2. Notice that we must add a table called works_in with a dedication_pct field to indicate Bill’s time split between multiple departments.

employee

ename

age

salary

 

Bill

36

10000

 

Sam

27

15000

 

Fred

29

12000

 

department

dname

floor

budget

 

Shoe

1

1200

 

Toy

2

1400

 

Candy

1

900

 

works_in

ename

dname

dedication_pct

 

Bill

Toy

60

 

Bill

Shoe

40

 

Sam

Toy

100

 

Fred

Shoe

100

Revised Example Database Table 2 

 

JOINS in MongoDB

In MongoDB, there are two main ways to express a relationship, namely “embedded” and “reference.” Using the embedded approach one must decide which document is the “outer” and which is the “inner”. It is natural to make the favored document the outer one, and we might choose Employee for that role.

{
    "_id": "1",
    "ename": "Bill",
    "age": 36,
    "salary": 10000,
    "department": {
        "dname": "Shoe",
        "floor": 1,
        "budget": 1200
    }
}

In other words, the department information is stored embedded in each employee document.  Basically, this is an “inline” representation.  In document applications, this representation may make some sense, but in structured data, it has two major drawbacks. 

First, department information is repeated for each employee in the department.  Since Bill and Fred are both in the Shoe department, information will be replicated.  When Shoe information gets updated, say the budget is adjusted, all copies must be found and correctly updated.  If even one replica is omitted, then an inconsistent (corrupted) data base results.  Worse, this multi-record update operation is either non-atomic (by default in MongoDB); or requires MongoDB’s 4.0+ multi-document transactions, which have several limitations and incur a performance hit.  In the first case, a corrupted database is possible; in the second case poorer performance will occur.  Either situation is problematic. 

Second, there is no place to put Candy information, because there is no employee in this department right now. Because of these two limitations, we will not consider this representation in this blog post.

The alternative representation is “reference.” In this case, Bill’s record would be stored as:

{
    "_id": "1",
    "ename": "Bill",
    "age": 36,
    "salary": 10000,
    "department": "1001"
}

and there would be a separate department document to store the attributes of departments:

{
    "_id": "1001",
    "dname": "Shoe",
    "floor": 1,
    "budget": 1200
}

Effectively, this looks a lot like the relational representation in Table 1. However, the programmer must know that the department field refers to a document in the department collection.  In other words, there is no notion of foreign keys to assist the programmer in specifying the join.

Now suppose management decides that employees can split their time between multiple departments. There is no standard way to represent this situation, but typically an array of references would be used.  In order to record the information about the dedication_pct in a given department, we could transform the department field into an array (techically a collection) of objects, which contain both the reference to the department collection and the dedication_pct, as noted below for the employee Bill:

Collection department:

{
    "_id": "1001",
    "dname": "Shoe",
    "floor": 1,
    "budget": 1200
},
{
    "_id": "1002",
    "dname": "Toy",
    "floor": 2,
    "budget": 1400
}

Collection employee:

{
    "_id": "1",
    "ename": "Bill",
    "age": 36,
    "salary": 10000,
    "departments": [
      { "dept": "1002", "dedication_pct": 60 },
      { "dept": "1001", "dedication_pct": 40 }
    ]
}

When one wants to obtain joint (“related”) information, she may either do the join manually in the application (which is error-prone and puts the burden into the developer) or use the $lookup aggregation operation (which has its own limitations, for example not being able to $lookup from a sharded collection, or not supporting right or full outer joins).

JOINS in Postgres

In theory, one can use an embedded representation in Postgres, and the Postgres jsonb datatype allows this. However, we have very rarely seen this in practice because of the drawbacks noted above.  Instead, one typically uses the representation in Table 1, which corresponds to the “reference” case in MongoDB.  

Using standard SQL, one can find a list of all departments and the total salary of their employees following the data model of Table 1 as (click here for full source code for Postgres data and examples): 

select          dname, sum(salary)

from            employee as e

                        inner join

                        department as d

                        on e.department = d.dname

group by        dname

;

┌───────┬───────┐

│ dname │  sum │

├───────┼───────┤

│ Shoe  │ 22000 │

│ Toy   │ 15000 │

└───────┴───────┘

When employees can work in multiple departments, one would typically use the representation in Table 2. The query above can be readily converted to: 

select          dname, sum(salary * dedication_pct * 0.01)

from            employee

                        inner join

                        works_in

                        using (ename)

group by        dname

;

┌───────┬──────────┐

│ dname │   sum │

├───────┼──────────┤

│ Shoe  │ 16000.00 │

│ Toy   │ 21000.00 │

└───────┴──────────┘

Notice that one merely needs to add dedication_pct to the aggregate clause and replace employee by works_in changing the attribute in the join condition to move from the first join to the second.

JOINS are Brittle in MongoDB

Now reconsider the case where the semantics of the data obeys Table 1, i.e. the join between employees and departments is 1:N. To construct a document containing the total salary for each department, the code in MongoDB is: (see full source code here):

db.employee.aggregate([
  {
    $lookup: {
      from: "department",
      localField: "department",
      foreignField: "_id",
      as: "dept"
    }
  },
  {
    $unwind: "$dept"
  },
  {
    $group: {
      "_id": "$dept.dname",
      "salary": { "$sum": "$salary" },
    }
  }
]);

Result:

{ "_id" : "Shoe", "totalsalary" : 22000 }
{ "_id" : "Toy", "totalsalary" : 15000 }

Notice that the code is a great deal more complex than the Postgres code, because MongoDB doesn’t have relational join notions and is in a lower level language than SQL. Also, it requires the programmer to algorithmically construct a query plan for the join.  In this case, a combination of $unwind, $lookup and $group to extract the desired information. More ominously, when we move to the semantics of Table 2, (adding the dedication_pct field to allow employees to be in multiple departments) the MongoDB join code must be significantly rewritten to add two more aggregation stages to “unwind” the “works_in” implicit relationship:

db.employee.aggregate([
  {
    $unwind: "$departments"
  },
  {
    $project: {
      "_id": 0,
      "salary": 1,
      "department": "$departments.dept",
      "dedication_pct": "$departments.dedication_pct"
    }
  },
  {
    $lookup:
    {
      from: "department",
      localField: "department",
      foreignField: "_id",
      as: "dept"
    }
  },
  {
    $unwind: "$dept"
  },

  {
    $group:
    {
      _id: "$dept.dname",
      totalsalary: { $sum: { $multiply: [ "$salary", "$dedication_pct", 0.01 ] } }
    }
  }
]);

Result:
{ "_id" : "Shoe", "totalSalary" : 16000 }
{ "_id" : "Toy", "totalSalary" : 21000 }

Hence, on a change to the semantics of the join from 1:N to M:N, the application (or applications) must be significantly rewritten. In contrast, in Postgres, the queries remain nearly unaltered and simple.
All previous results are, however, a bit misleading, because the Candy department has no employees and does not appear in the join. Let’s assume that the user actually wants to see all three departments with their total salaries. In other words, she wishes to see the Candy department with a total salary of zero in addition to the other two departments. In Postgres, this requires a simple change to the query, namely adding a right outer join to the department table:

select          dname, sum(coalesce(salary * dedication_pct * 

 0.01, 0))

from            employee

                        inner join

                        works_in

                        using (ename)

                                right outer join

                                department

                                using (dname)

group by        dname

;

┌───────┬──────────┐

│ dname │   sum │

├───────┼──────────┤

│ Shoe  │ 16000.00 │

│ Toy   │ 21000.00 │

│ Candy │        0 │

└───────┴──────────┘

However, in MongoDB there is no support for right outer joins. Hence, you would need to add the “0” manually in your application. Of course, this is a burden to the developer and is cumbersome and error prone.

It is well known (and presented in every DBMS textbook) that RDBMSs present superior data independence relative to lower level solutions like MongoDB. Because databases last a very long time, and the semantics can be expected to change from time to time, superior data independence is a very desirable feature, present in Postgres but not in MongoDB. As such MongoDB joins are a brittle solution.

Performance Comparison of PostgreSQL vs. MongoDB

In this section, we report on the performance of the two queries in the previous section, namely to find the total salary of each department, with or without the departments with no employees.

All of the source code in this section appears here, where the interested reader can find a loader program that generates synthetic data in Postgres, and then using Postgres JSON functions to export the data in a format suitable for importing into Mongo. Then, we run the two queries over both data sets and compare execution times.

The benchmark has been tested on AWS, using an EC2 i3.xlarge instance (4 cores, 32 GB RAM), on a local NVMe disk formatted with XFS. Basic tuning was performed on the Postgres instance, and Mongo production best practices were followed. The benchmark was performed using 4000 departments and 20M employees, with a given employee working in between one and three departments. Data size was 6.1 GB in Postgres and 1.6GB in Mongo (using default compression). The total execution time is shown in the tables below.

(less is better)

Postgres (s)

Mongo (s)

MongoDB / Postgres

1:N case

9.97

1,162.91

129.61

1:N Performance Comparison between MongoDB and PostgreSQL Table 3

MongoDB is 130 times slower than Postgres because the only join tactic available is to iterate over employees, for each one performing a lookup in the department table. In contrast, Postgres can use this tactic (called iterative substitution) as well as merge join and hash join, and the Postgres query optimizer will pick the expected best strategy. MongoDB is constrained to a single strategy. Whenever (as is almost always the case) this single strategy is inferior, poor performance will result. 

Changing the MongoDB query execution strategy either involves restructuring the database (and thereby requires rewriting all of the queries) or implementing a query optimizer in the application (which is a huge amount of work).  

Things are also bad in the M:N case. Again, MongoDB has a single strategy, hardcoded into the application. In contrast Postgres can pick from all of the available options.  Table 4 shows the resulting performance comparison. 

(less is better)

Postgres (s)

Mongo (s)

MongoDB / Postgres

M:N case

54.09

2,706.19

50.03

M:N Performance Comparison between MongoDB and Postgres Table 4

 

In summary, relative to Postgres, joins in MongoDB:

  • Are brittle as noted in Section 4.  If anything changes over the lifetime of the database, then MongoDB requires significant recoding, while Postgres requires more modest changes.
  • Generally execute with poor performance as noted in Section 5.  MongoDB does not have a query optimizer and the execution strategy is hard-coded into the application. Whenever merge-sort or hash-join is the best choice, Mongo performance will suffer.

Of course, one should wonder “maybe I chose the wrong document as the “outer.” What would happen if want to change to a design that makes Department the outer document. We examine this situation in our next blog post. 

This blog was co-authored by Álvaro Hernández, founder of OnGres.

Read our white paper, Performance Comparison of MongoDB vs. PostgreSQL, to learn more. 

mstonebraker's picture

Dr. Stonebraker has been a pioneer of database research and technology for more than forty years. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebrake ...