The “SET NOT NULL” Downtime Trap in PostgreSQL

It’s necessary to validate existing data when trying to set a column to “not null.” The validation requires an “Access Exclusive” lock which blocks all access to the table, i.e. not only prevents insert/update/delete but also blocks select.

The duration of the access exclusive lock depends on the table size. In order to support such schema migrations without extended downtime, PostgreSQL version 12 introduced an optimization: when altering a column to “not null” the table scan validation step can be skipped if there exists a valid constraint can prove the column is indeed “not null” at runtime.
Here is a process to set a column to “not null” without downtime:

  1. create a check constraint but don’t initiate validation (no table scan if NOT VALID is specified):
    • ALTER TABLE users ADD CONSTRAINT email_not_null_chk CHECK (email IS NOT NULL) NOT VALID;
  2. validate the “not null” constraint:
    • ALTER TABLE users VALIDATE CONSTRAINT email_not_null_chk;
  3. set column “not null” when the constraint is validated successfully
    • ALTER TABLE users ALTER COLUMN email SET NOT NULL;

This process doesn’t require a downtime because “validate constraint” (2nd step) requires a much less restrictive lock SHARE UPDATE EXCLUSIVE. The lock is compatible with ROW EXCLUSIVE, required by insert/update/delete, so normal database operations are not blocked by the check constraint validation.

Now the column is “NOT NULL” so the constraint email_not_null_chk should be dropped.

DO NOT ALTER TABLE users ALTER COLUMN email SET NOT NULL, DROP CONSTRAINT IF EXISTS email_not_null_chk;!!!
It’s not unreasonable to assume the drop will occur when schema update is done. However PostgreSQL performs DROP CONSTRAINT before SET NOT NULL therefore the optimization introduced in version 12 will not happen because by the time the execution flow reaches SET NOT NULL the constraint is already gone. A full table scan will start to confirm no records violate the not null column constraint. Suddenly a seemingly harmless milliseconds metadata only schema update blocks every access to the table.

Official manual can’t document every scenario but this can be confirmed in source code:

  1. AlterTable entry point that calls ATController to manage execution flow
  2. after phase 1 (which validated user SQL statement and constructed pending tasks, i.e. wqueue), phase 2 iterates from 0 to 11 to execute tasks in that order, we can see drop is the first thing executed if the alter statement included one.
  3. ATExecCmd executes the prepared command and ATExecDropConstraint drops the constraint that was prepared in ATPrepCmd during phase 1

Now it’s clear the alter statements should be split:

  1. ALTER TABLE users ALTER COLUMN email SET NOT NULL;
  2. ALTER TABLE users DROP CONSTRAINT IF EXISTS email_not_null_chk;

If PostgreSQL logging level is set to debug1, there will be an entry to confirm the optimization actually occurred:
-. existing constraints on column "email" are sufficient to prove that it does not contain nulls

Most importantly: TEST EVERYTHING!

Untitled

References

optimization
set not null lock
validate constraint lock
lock compatability

Leave a Reply