Building a CDC for AWS Aurora Postgres
In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data.
Lately I’ve been working on a CDC (Change Data Capture) system, that tracks changes from Postgres-based databases and sends them to another destination.
This is a very common scenario for a CDC — syncing data between databases. Thus my research started with checking what other, including Debezium, use to capture data changes from Postgres, which resulted with WAL.
Write Ahead Logs (aka WAL), are the transaction logs of Postgres. Briefly, data changes are written only after they have been logged. This means that by following the stream of data in the WAL, you know about any data change in your tables (except truncate) and can take action. Due to some business and runtime restrictions, I chose to write the WAL listener on my own in Go.
However, when dealing with AWS managed databases I dealt with gaps in the available materials, and I hope to bridge these gaps in this post.
Let’s start with describing the planned solution: Use the pub-sub model, the Postgres would publish data changes to one or more subscribers.
The reasons for choosing the solution are:
- Monitoring data changes based on WAL is more reliable
- It is fairly easy to get started
Creating a Postgres change listener
The instructions of starting to follow WAL are simple:
- Enable Logical replication on your Postgres
- Create a replication slot
- Optionally add wal2json to receive new data as JSon messages
- Connect to the database with the url option replication=database
- Start consuming your data
Since I am writing a CDC in Go I wrote a WAL consumer similar to pg_listener (I used newer versions of the Postgres connection libraries).
At this point I was pretty happy with myself, I was able to quickly create an application that listens to Postgres changes and I could now attend to the data at its destination.
However, AWS RDS
A CDC is written to transmit data changes from a source to a destination, so it is not developed in a vacuum, the source and destination systems have their own architecture that I had to adapt to.
In some of the setups the source Postgres can be either:
- AWS RDS Postgres
- AWS RDS Aurora
AWS Relational Database Service (RDS) is a managed relational database service, one of the optional database engines to choose from is Postgres. It provides resizable capacity while automating time-consuming administration tasks.
AWS RDS Aurora is a MySQL and Postgres compatible relational database built for the cloud, it gives services such as scale, fault tolerance and more. Amazon Aurora is up to five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases.
Reading many blog posts, discussions and guides I could understand that my code should not change, but I wasn’t quite sure how to make the RDS databases replicated data be available to the listener.
I found a blog post in the AWS website, showing how the technique I described in the beginning of this post can be used.
While looking for materials about RDS Aurora Postgres and its logical replication I came up with few results, so I had to rely entirely on the following guide.
The first 2 steps are:
- Enable logical replication
- Create a publication for multiple tables (one, list or all tables are supported)
However, the next step was not clear. The guide stated:
For this example, table data is replicated from an Aurora PostgreSQL database as the publisher to a PostgreSQL database as the subscriber. Note that a subscriber database can be an RDS PostgreSQL database or an Aurora PostgreSQL database. A subscriber can also be an application that uses PosgreSQL logical replication. After the logical replication mechanism is set up, changes on the publisher are continually sent to the subscriber as they occur.
From this I understood that I do not need to create a Postgres subscriber. However, since creating the subscriber also results in a replication slot, I still had to do the following on my own:
- Create a replication slot
- Add Wal2Json (because the code was already based on JSon messages)
The following command solves both problems, run it in your favourite Postgres client:
SELECT * FROM pg_create_logical_replication_slot(<slot name>, 'wal2json')
Yes, both RDS Postgres and RDS Aurora permit you to start using wal2json immediately.
- Publications and Subscriptions were introduced in Postgres 10, and are available for Aurora as of version 2.2.0
- At this point I had to change my original Go code because in that code the listener specified the list of tables to listen to
At this point I started the Go application, and data started to flow. Success!
Though I had some trouble getting started with AWS RDS, IMHO, basing the CDC on WAL remains a more reliable solution than writing my own system to detect data changes in the database.
I do recommend spending time reading about:
- Logical Replication
- WAL2JSON if you are planning to use it (and I do recommend using it)
- Postgres publications and subscriptions
Who I am
My name is Sigal Shaharabani — I am a data engineering technical leader in Tikal.