The One and the Many

Changing owners in PostgreSQL

I recently needed to mass change ownership in a Postgres database. There were two users, $old1 and $old2, who owned objects in the database. I needed to make everything in the database owned by a single user. This sounds like it should be simple, and in theory it is. However there were complications.

The approach I settled on was changing ownership of everything to the user $new. This had the additional benefit that I could pick a more suitable name as the old usernames were not ideal.

I needed to run code that did something like this:

REASSIGN OWNED

Postgres has a command to do exactly that: REASSIGN OWNED. In theory, I could simply say REASSIGN OWNED BY $user TO $new for both $old1 and $old2 and I'd be done.

Unfortunately it was not as easy as that.

Problem: "ERROR: tuple concurrently updated"

The first problem was the project's tests. As part of its setup phase, a test creates a clean database. This involves running each database migration. The tests run concurrently, so they ran REASSIGN OWNED concurrently.

Postgres normally acquires appropriate locks for each statement. However, this is not currently the case for system catalogs. As there was no lock, tests raced to change the same system catalog rows, and Postgres complained. Note that while tests create their own databases, they share the same database cluster. System catalogs are cluster wide rather than per database.

Since the problem was that we were not locking, I tried taking a lock manually at the start of my migration: LOCK pg_class. This resolved the errors, but caused the time to run the tests to blow up.

Next I wondered if I could catch and ignore these errors. In theory the race did not matter as it was a race to set identical values. However, this was not feasible because the errors could occur when running any command that touches these system catalog tables and not just inside my migration. For example, creating a database or a table could encounter it. Regardless, it may not be safe to assume ignoring it would leave us in an expected state.

I resolved this problem by updating the base schema used by the tests. This meant the tests would never run the migration. They would start out with ownership in the correct state.

Problem: Deadlocks

Once I had the migration working in the tests, I tested it with other actions occurring in the system at the same time. This revealed another problem:

process 9157 detected deadlock while waiting for AccessExclusiveLock on
relation 1242620572 of database 16401 after 1000.108 ms.

REASSIGN OWNED acquires ACCESS EXCLUSIVE locks on all objects it needs to change ownership on. Since the old users owned a large number of objects, this meant REASSIGN OWNED had to acquire locks on many objects at the same time.

In theory I could have the command retry until it succeeded. However, Postgres's deadlock resolution kills one of the queries causing the lock. I didn't want to risk killing important queries.

I wondered whether I could synchronize access. The query that interfered with the migration was a PL/pgSQL function, so I thought I could update its caller to take an advisory lock before running the function. However, I realized many transactions in the system could trigger this situation. There was no way it would be feasible to add enough advisory locks.

I decided REASSIGN OWNED was unworkable for my purposes. I needed to do what it did, but take as few locks as possible at any one time.

ALTER ... OWNER TO ...

Each type of object (with some exceptions) has a command that you can run to change its ownership. For tables, it's ALTER TABLE $table OWNER TO $new.

I could do something like this:

This would be many more statements than REASSIGN OWNED, but it seemed doable.

In order to write this, I needed to figure out:

  1. What types do I need to change?
  2. How do I list objects of each type?
  3. How do I change ownership of objects of each type?

1. What types do I need to change?

Running DROP ROLE $role causes Postgres to raise an error telling you what objects, including their type, the role owns. For example:

mydb=# DROP ROLE myrole;
ERROR:  role "myrole" cannot be dropped because some objects depend on it
DETAIL:  owner of sequence mytable_id_seq
owner of table mytable
owner of sequence mytable_id_seq

I repeatedly ran this command for each of the old users until I had accounted for every type in my database. This meant I only worried about types in use in my database rather than trying to handle every possible type.

2. How do I list objects of each type?

Postgres provides two ways to look up this kind of information: The Information Schema and the System Catalogs.

The system catalogs are the source of truth. The information schema is a set of views into the system catalogs and is often easier to use. However, the information schema does not provide a view for every object.

I first tried to query for what I needed using the information schema. If I could not get the information that way, I went to the system catalogs and got it there. Eventually I found how to query for each type of object.

3. How do I change ownership of objects of each type?

In general you can look up the ALTER command for a particular type and it has a corresponding ALTER ... OWNER TO ... form. Unfortunately, this is not the case for every type of object. For example, there is no ALTER EXTENSION ... OWNER TO ... form.

For cases without an appropriate ALTER command, I had to UPDATE the system catalogs directly. This is a little hair raising, but the system catalog documentation is quite good.

The result

I ended up with a complicated migration, but it did the job. Core to the migration was avoiding using a transaction if it was not required. By avoiding transactions, each statement takes the minimum number of locks. The only explicit transactions I used were for objects that required touching the system catalogs as I found I needed to touch at least two catalog tables for each of those.

Problem: The lock queue

Unfortunately, testing revealed there was still a problem with locks.

Remember this migration had to run on an active system. The system routinely issues queries such as SELECTs. Normally SELECT queries can run along side each other as they take ACCESS SHARE locks which do not conflict. However, my migration generally had to take an ACCESS EXCLUSIVE lock on each object, and this lock conflicts with every other lock. Once such a lock enters the queue for an object, other new queries must wait. Consider this situation:

This could easily block crucial time sensitive actions in the system.

Limiting how long we wait on a lock and retrying

I wanted to block other queries as little as possible. It didn't matter if my migration took a little longer to run.

Instead of waiting to acquire a lock for an unbounded amount of time, I wanted to time out. You can do this for every query coming from your session using a setting:

SET lock_timeout TO 50

This meant my migration's commands would block other commands for at most 50 milliseconds. If a command waits 50 milliseconds, it aborts with an error: ERROR: canceling statement due to lock timeout.

I updated my migration to retry any command that saw this error. It retried until it succeeded.

Success

In the end the migration ran smoothly. It did have to retry several times after hitting the lock timeout, but that was fine.

This turned into more of a project than I expected when I started. I think it shows a lot of important aspects about using PostgreSQL though. It also shows that things that look simple are often not.

Tags: PostgreSQL, SQL, programming