The One and the Many

PostgreSQL VACUUM taking an access exclusive lock

Recently at work we had an incident that turned out to be caused by a VACUUM command running against one of our databases. I thought this would be an interesting subject for a post because I didn't know that VACUUMs could do this, and it is useful to think about how to deal with this behaviour.

We regularly run VACUUMs against our databases using flexible-freeze. This calls VACUUM FREEZE ANALYZE against tables needing maintenance. During the incident, this command took an ACCESS EXCLUSIVE lock against a table we use during API requests for approximately four minutes. This blocked many requests and additionally caused us to exhaust available Postgres connections because we had queries stuck for the duration.

The fact that a VACUUM other than VACUUM FULL could take an ACCESS EXCLUSIVE lock was a surprise to me. However, the Postgres docs say this can happen when the TRUNCATE option is enabled (which it is by default). This is so that the VACUUM command can return disk space to the OS by truncating the end of relations. This is done in the function lazytruncateheap().

Needing to take a strong lock for this purpose makes sense, but issues can arise if a strong lock like this is held for longer than very short periods, which is what happened here.

A question is why VACUUM would take such disruptive action and not abort when queries were waiting on it. The problem mainly occurs when hot standbys are involved. VACUUM takes the lock on the primary database and the lock is then replicated to standby servers as described in the code. The VACUUM running on the primary is not aware of the queries on the standbys waiting on the lock and continues happily along. This is a known issue with VACUUM and hot standbys, and lead to the addition of the above TRUNCATE option to disable the behaviour which was discussed in this thread resulting in this commit.

Now that we know it can happen, we can figure out how to deal with it in the future.

There are a few ways we can handle it:

  1. We could tell VACUUM to not do this. This can be done by running it with the TRUNCATE option disabled or by disabling vacuum_truncate on a per table basis. However, this means we'd never return disk space to the OS, which could lead to different issues. Potentially disabling it could be combined with separate VACUUM runs with truncating enabled that would be run with more care, such as with additional timeouts, or by periodically rewriting the table if needed.
  2. We could design tables such that fewer rows need to be deleted. The table in question is one we delete from regularly. To achieve this, one option is to drop the table rather than delete from it, such as via expiring old rows using partitioning. Not deleting as many rows means there is a reduced likelihood VACUUM would need to take the action involving the strong lock, though the risk would still be present.
  3. We could spread out DELETEs over time. The table in question had a large delete happen all at one time, leaving many dead rows. If the DELETEs instead happened over a period of time, freeing disk space to the OS may not have been needed.
  4. We could run the VACUUMs with statement_timeout set to a short duration. This means that we'd still allow the maintenance to take place, but we'd abort the queries if they take too long. The problem is that it is usually fine and desirable to let VACUUMs run for some time. It is only an issue if they take a strong lock. It would be ideal if we could specify a timeout only in the case where a strong lock is involved, but that's not currently possible. As is, this is likely not a good option.
  5. We could monitor for and abort any VACUUM queries that take a strong lock. This would let VACUUMs continue to run over time but stop them "immediately" when they take a strong lock. The idea is that in general these operations are fine to run, and even if they take a strong lock, they need to complete their task holding the strong lock quickly to be acceptable. My thinking is to run something like this every few seconds. It remains to be seen whether this is actually a good idea though. My assumption is that the VACUUMs would still be able to return disk space to the OS, even if it may take a few tries. If they are never able to do that, this is a hacky version of option one.
  6. We could use a statement_timeout on all online queries running in our application. This way even if a table gets locked like this, the system won't effectively go offline.
  7. Enabling the old_snapshot_threshold setting has the side effect of disabling this behaviour ("freed space at the end of a relation cannot be released to the operating system"). This is effectively the same as option one, though may be desirable for other reasons.

Option two, three, and six are good practices regardless, so these should be considered as part of the solution.

Initially while writing this post I believed option five was also necessary and the way to go. The other options do not guarantee VACUUM won't lock tables, merely reduce the chance (two and three) or make the application behave better if it happens (six). This means that there is still a risk that a VACUUM would lock tables and impact the application.

However, I now think option one or seven (disabling the truncation behaviour) is a surer and less hacky approach. This does mean that tables may grow in size, but that can be dealt with by rewriting the table if necessary, such as with pg_repack. What swayed me was the fact that the TRUNCATE option exists in the first place, as well as old_snapshot_threshold disabling the behaviour anyway.