The Ultimate Guide to PostgreSQL Data Change Tracking

cpursley · on Feb 24, 2024

Great post, I really like the approach of adding additional context for WAL CDC. I imagine this could be implemented with something like Hasura pretty easily (you can access hasura session data in the db).

I use both the trigger + audit table approaches in my sass (for user-facing activity feeds) and subscribe to CDC WAL changes for dealing with callback-like logic (i.e., send registration email, clear cache).

I'm not a fan of the pg_notify approach due to requirement of adding triggers (performance penalty) and the 8k character limit per column (you will lose data as it will splice off anything larger than that).

Application-level tracking or callbacks makes the stack dependent on the application for data integrity, I'd rather the database be the source of truth on all things data. Especially in the age of microservices.

For something turn-key, Bemi looks like a really good option - especially if you need persistence of the changes!

For something very light wight - check out WalEx (I'm the maintainer):

https://github.com/cpursley/walex

dkubb · on Feb 24, 2024

Has anyone ever implemented a schema where you define a view to return the latest instance of each resource, and it maps INSERT, UPDATE and DELETE to an append only table underneath?

I’ve done some of this with an event sourcing system, but not the query rewriting. All my reads use the “latest view”, but the app uses functions to write to the underlying scheme rather than doing “simple” DML.

I wondered if keeping the storage as append only under the hood would keep the app layer simple.

benpacker · on Feb 24, 2024

The end result is almost the same as an audit table except there's much more work to do at read time, and some queries simply cannot scale as they need to with this approach.

In general for most apps you'd rather have 2 writes (write to table and write to audit table) than 1 write to the event log and hard to scale reads.

rrr_oh_man · on Feb 24, 2024

Why would it keep the app layer simple? As in: Insert only? (Create = Insert, Update = new Insert, Delete = new empty Insert with deleted flag)

vivzkestrel · on Feb 24, 2024

This article brings a very good question, let us say I have a stock news app and want to issue an alert everytime a new news item is added to the database for AAPL. There could be a 1000 symbols. What is the best way to implement this alerting system in postgres?

arjunlol · on Feb 27, 2024

It depends on your reliability and performance constraints. I'd recommend just doing this via hooks on the application layer if it's a simple application. If you're building a production scalable application that required reliability guarantees - you can do this with https://github.com/BemiHQ/bemi too (I'm one of the maintainers from the article). I saw you mentioned you're using Node.js, check out the Bemi github there's a few Node.js compatible libraries too. Feel free to ping me any q's!

cpursley · on Feb 24, 2024

The WAL CDC approach: https://github.com/cpursley/walex?tab=readme-ov-file#publica...

CREATE PUBLICATION news_item FOR TABLE news WHERE (topic IS "AAPL");

vivzkestrel · on Feb 25, 2024

any suggestions for node.js on this? This one seems specific to elixir

cpursley · on Feb 25, 2024

Node is not appropriate for this sort of workload. You can use WalEx and forward the events to your Node app via webhook: https://github.com/cpursley/walex?tab=readme-ov-file#webhook...

bennyp101 · on Feb 24, 2024

Assuming by issue an alert you want to call an external service, I’d go with listen notify and use something like Graphile-worker to pick it up and process it. Or you could stay in pg and use a FDW with a trigger

crucialfelix · on Feb 24, 2024

I would avoid using database triggers for business logic.

In the app (Django or whatever) after inserting news item, call a service (async) that matches users, sends alerts

DandyDev · on Feb 24, 2024

Agreed. Triggers always feel really clever and powerful, but they live outside your regular application code and are therefor more or less invisible. This will eventually break the law of least surprise.

On top of that, triggers are hard to test.

dventimi · on Feb 24, 2024

Not if your regular application code is in the database

eknkc · on Feb 24, 2024

I generally use triggers for these kinds of things.

I’d create a message queue table that would be inserted with jobs according to changes in other tables. Makes it easier to handle the message queue asynchronously.

Doing it in application side in a reliable way needs some kind of a distributed transaction mechanism which might not be feasible. I don’t want to fire an insert and then fail to send an alert and find myself within an if statement thinking “what the fuck do I do now?”

You could insert the job on the application code too. I just like triggers and abuse them which might not be a good idea in all cases.

Depends on the requirements though.

crucialfelix · on Feb 24, 2024

I'm using Django and Celery, so the tasks (eg. Send notifications) are scheduled on db transaction commit.

The business logic is clear to read, easy to change.

Ive enjoyed writing complex SQL, but when it comes to app development I want my engineers to just write python. Triggers are mostly invisible in the codebase and I would end up having to do all the maintenance myself.

There is of course more than one way to skin a cat, but I want our codebase to have just one single way.

caerwy · on Feb 24, 2024

Another technique not discussed is to create an md5 hash of each record and to keep track of added and deleted hashes. The d hash would exist as an additional column on each table being tracked.

agarren · on Feb 24, 2024

If I understand you right, this seems like a very course-grained way to track changes. You can record that a change was made, but not the specific change. It seems like it'd help facilitate something like, say, an etag, but I don't think could get any auditable data using this alone, could you? Keeping track of added/deleted hashes would probably be best handled by a trigger, unless you really want that in your app code, so this seems a lot like an audit trigger, with very little audit data. Have I misunderstood?

klysm · on Feb 24, 2024

I'm not sure I understand what you mean. Where are the hashes tracked?

caerwy · on Feb 24, 2024

here's the article I learned this from. https://www.mssqltips.com/sqlservertip/2543/using-hashbytes-...

you'd add a hash column to the table you want to track. this would be used in a data warehouse where you want to track what has changed after a truncate and load. you'd keep additional tables on the side to track added/and delete hashes for a delta copy to downstream application tables.

Incipient · on Feb 25, 2024

For tables that don't need to scale infinitely, and to trade off in simplicity and keeping logic in the application, I use a version column.

It's the "brute force" type approach. It doesn't scale, has performance penalties, but it's incredibly easy to use and understand, and time walking is trivial.

Relies heavily on DISTINCT ON.

On truly old tables I've added "maintenance tasks" that push old rows into an archive/audit table.

Edit/note - I realise if you suggest this at any of the big players, you'll probably be fired on the spot.

arjunlol · on Feb 27, 2024

Hey I'm one of the maintainers of https://github.com/BemiHQ/bemi that was recommended in the article. Please feel free to ask me any questions!

hbarka · on Feb 24, 2024

Is write-ahead log something new in PostgreSQL? I recall SQL Server being able to do log shipping since the 90s.

klysm · on Feb 24, 2024

The WAL itself isn't new, but the capabilities to use it externally in a robust manner have improved in recent years via Logical Replication.

anarazel · on Feb 24, 2024

Logical decoding has been added in PG 9.4, released approx 9 years ago. WAL itself was added in 7.1, back in 2001.

_boffin_ · on Feb 24, 2024

There are also other way, such as bitemporal tables.

- https://aiven.io/blog/two-dimensional-time-with-bitemporal-d...

- https://github.com/scalegenius/pg_bitemporal

4 timestamps and some ugly queries.