Documenting your PostgreSQL database

SugarfreeSA · on July 27, 2015

I believe that documenting anything with comments (be it code or a PostgreSQL database) should only be done for more complex pieces of code or to elaborate on something that could be misinterpreted.

Misguiding or incorrect comments caused by "Comment Rot" will actually cause more harm than having no comments at all. In fact, your code should be self explanatory (if you do it correctly) in terms of what it is doing. Adding extra comments to elaborate on "why" something is being done may be more useful - i.e. When the semantics of the code are not as easily deduced from the syntax.

For more information on "Comment Rot": http://blogs.msdn.com/b/ericlippert/archive/2004/05/04/12589...

However, I do find documenting database tables interesting and will consider doing it in the future under very specific circumstances. Thanks for sharing!

eitally · on July 27, 2015

I've worked intensively on a schema (as an end user developing reports against it) where several columns in one of the tables were flex fields whose content/use changed dependent on a different column's value. It was always a bear to train new developers on it and it wasn't documented anywhere (until I created a flat file documenting it). In a situation like that I'd have loved inline tabular comments.

Similarly, if a table or a field has become "legacy", that is, no longer actively used but kept around because there's old data in it somewhere, that should be commented, too. Or, if a column has it's datatype changed for some reason while there is data in the table, that should probably be noted, too (e.g. char --> varchar, int-->string, etc).

Schema versioning is a huge problem at many companies, especially non-tech companies where the developers are completely at the mercy of their "business stakeholders", and since full-fledged data dictionaries don't exist for almost anything, using simple comments like this could be a boon. I dunno, ymmv, but I'd have appreciated it.

SugarfreeSA · on July 27, 2015

Yes, I definitely see value of comments in a situation like flex fields because that's something that may not be straight forward or clearly apparent!

Legacy tables or fields also make sense because the code will remain the same and thus lead to no comment rot.

As long as the comments stay relevant and cause no extra confusion, then it is useful! I suppose it is also situation dependant as well as the preference of the development team at the time.

cmdkeen · on July 27, 2015

I agree that normally code should be clear and obvious and require little in the way of comments. The difference is that tables aren't code.

I work with a large Oracle DB which has a mixed attitude towards column comments. You don't have many characters to semantically name a column. If you're using table per hierarchy, or denormalised conventions then it can be entirely unclear which columns are supposed to be used for each row types.

Comment rot is a useful thing when it highlights that something is wrong - e.g. a new value in a table commented as accepting "Y" or "N", which prompts you to go an examine code elsewhere to find out what is going on.

My current thoughts are that all table and columns should have at least some form of comment.

NDizzle · on July 27, 2015

I add my comments to the commit message that contain the schema changes. If you seriously add a comment like 'required first name of user' you are wasting everyone's time.

jsnell · on July 27, 2015

Especially when at least right now one of the things being documented in the example as "first name of the user" is the last_name field!

  last_name  | character varying(50)       | ... | required first name of user

That shows two documentation antipatterns, actually. Documentation that conflicts with the code, and documentation so generic that it gets carelessly filled in with copy-paste.

cmdkeen · on July 27, 2015

I work with Oracle these days, or the ... might contain it - but is it not also missing a NOT NULL constraint?

Documentation that conflicts with the DDL makes for an easy code review at least. I look at it more like an interface specification, especially as there are several ways to write DDL - though there should really be a house style. For instance the NOT NULL constraint might be being added later on the script, so it isn't necessarily readable in the same way as normal code.

otherusername2 · on July 27, 2015

Yes, that would be a waste of time. But in the past I've gotten in the habit of commenting things in the database for larger projects, and have written a bunch of tools to extract those comments and generate database documentation from it. For large databases (table-wise, not data-wise), it's been absolutely invaluable to be able to just take the DB docs and reason about how stuff works in there. "Oh, this table is fed by a trigger from table X!", "Ah, the `system` table contains non-tenant specific global system settings". Stuff like that.

Yeah, in a perfect world everything in a database would be obvious from its name, but this is not a perfect world and naming things properly is extremely hard.

e12e · on July 27, 2015

> I add my comments to the commit message that contain the schema changes.

You're adding the information in "git diff" for the .sql-files for the commit to the comment?

NDizzle · on July 27, 2015

I have a repo for SQL changes that include synchronization files (.sql) for all of the changes I made. The diff does include a summary of the changes.

Unfortunately, it does not include any information about what the last_name field does.

hartror · on July 27, 2015

> You comment your code, why shouldn’t you comment your database?

I don't as a general rule, the code itself expresses what it is, comments explain why it is. The examples here suffer the common problem of explaining a language/tool feature with naive examples. For example the columns in the users table are self explanatory apart from the HSTORE "data" column, which itself is likely a crime in its own right.

The exception I would make is with complex queries, SQL isn't the most transparent language especially the more advanced/vendor specific techniques.

craigkerstiens · on July 27, 2015

Author of the original post here. Any time you come in new to a system and just assume the columns are self explanatory you often make a wrong assumption somewhere along the way. Being explicit is not much more effort to make it safer when analyzing or working with the data.

As for a 'data' column name for hstore. This is something I've often talked about, hstore is a key value store. The big value of it is adding and removing things fluidly over time, for that reason data can be a good a column name as any in certain cases.

sbuttgereit · on July 27, 2015

Absolutely could not agree more. The thing that even clearly named schemas, tables, columns, etc. miss are context and intent. If you have that context already you probably don't need much by way of documentation. But as someone that gets called into clients to work through their databases, I spend a lot of time trying to figure that context and intent. If it's documented, that learning curve is reduced greatly.

I usually do these comments on all objects regardless of how obvious the names and code may be. If even only for the reason of ensuring that doing this discipline remains second nature. I also like to remember that I'm not doing this for me, but rather for the client's team and others that come after me to work on things.

cmdkeen · on July 27, 2015

This. Database impact analysis is painful enough without a developer assuming 5 years ago that their table was so obvious it didn't need a comment.

ahachete · on July 27, 2015

It's always nice to have a post explaining that you can comment in SQL. Most people are not aware of this, and it definitely helps when reading a DDL.

I have to agree that sometimes there are superfluous comments that harm more than help, since they add nothing to clarify the purpose of the column but consume your time. For example, in your post, the comment "required first name of the user" for an attribute called users.first_name adds no extra information. I'd also avoid using "required" as part of the comment, as that's implied in the NOT NULL constraint. However, other comments are very valuable, like the one for "created_at" or "data".

Regarding having a last, nullable "data" hstore or jsonb column, I think it's a very good practice. A nullable column is almost always free (takes no extra disk space) and allows for storing information without a prior, clearly defined, or variable schema, without having to be ALTERing tables frequently, which usually is a challenge on its own.

narrator · on July 27, 2015

Anyone had any good experiences with tools to graphically document PostgreSQL with E/R Diagrams, etc.?

GregQuinn · on July 27, 2015

Nearly all the standard data modelling tools(ERwin,System Architect, Enterprise Architect ...) have a facility for reverse engineering a schema. This produces an ER diagram for the schema. It's worth pointing out however that this diagram gives you structure not semantics. The latter has to come from elsewhere.

ahachete · on July 27, 2015

Not an E/R diagram, but to generate table diagrams that you can navigate (in HTML), SchemaSpy (http://schemaspy.sourceforge.net/) is really good.

franch · on July 27, 2015

I use DBSchema for ER diagrams, works well. http://www.dbschema.com/screenshots.html

francisb07 · on July 27, 2015

We use lucidchart and balsamiq at times to model ERDs using the crow's foot style. There are certain gems for Ruby on Rails dev that generate ERDs from the schema,.

mmerickel · on July 27, 2015

LucidCharts has a decent import feature for a PostgreSQL schema.

teddyh · on July 27, 2015

I always did this, but maybe I’m unusual? I started with SQL by having a real project to do and by reading the PostgreSQL manual – I made extensive use of keys, triggers, functions, etc. Later I took some university classes in database design, which was interesting as the knowledge about normal forms, etc. complemented my existing experience.

I guess that if you instead started by using something like MySQL or SQLite then you would have a wholly different view of what an SQL database is for, and would have a hard time viewing SQL as something other than a data container with optional sorting.

mianos · on July 27, 2015

Not to harp on about it, but, if you use python, an ORM, like sqlalchemy can make documenting a schema much more manageable. Although it is saying the same thing, it is mainly because the relationships and derivations are more compact, in-line and not spread all over the place. I have taken to making a new git project with the just the schema and making it a dependency of the project.

fibo · on July 27, 2015

Ahah, probably if Postgres were written in python it would self document its data models

mianos · on July 27, 2015

It would not be real quick though :) SQLA, Python and Postgres the best of both worlds, and another world.

helge5 · on July 27, 2015

If it is worth commenting on, you might actually want to turn your SELECT into a VIEW:

CREATE VIEW "Names of projects with open tasks grouped by email" AS SELECT users.email, array_to_string( ...; COMMENT ON VIEW "Names of projects with open tasks grouped by email" IS 'More info on query ...';

j_s · on July 27, 2015

It is reasonable to require some info about each major stored procedure change as a comment at the top (when, who, why - something standardized).

It is nice to have this on-hand, and it should also be in source control check-in comments / project management work items or tickets too... this can be picked up auto-magically.

ExpiredLink · on July 27, 2015

You'd use a data(base) modeling tool which creates the DDL script - including comments - for you.

brightball · on July 27, 2015

Great feature not taken advantage of often enough.

sinzone · on July 27, 2015

PostgreSQL is loved by the developer community and is growing in adoption very fast. When we decided to open-source KONG [1], we went with the only supported db at that time which was Cassandra. Mainly we use Cassandra to handle rate-limiting across multi region and because it's also easy to scale horizontally. But even though is battle tested at FB or Netflix, we were surprised to see how the community immediately pushed [2] for supporting PostgreSQL as well.

It's definitely a good fit, especially for the powerful features around JSONB datatypes.

[1] https://github.com/Mashape/kong

[2] https://github.com/Mashape/kong/issues/331