The One and the Many

Collations in PostgreSQL

At work, we recently upgraded to PostgreSQL 17. One of the interesting additions in this version is a built in collation provider. I investigated what this means and thought I would write down what I learned.

Collations

What is a collation in Postgres? Basically it is something that defines how sorting text works, such as where non-ASCII characters appear in the ordering. It also controls certain operations on text, such as comparisons and whether operations like lower() understand non-ASCII characters. As is often the case, the Postgres docs cover the topic well.

There are many collations and they can come from different providers.

Collation providers

Prior to Postgres 17 there were two collation providers: libc and icu. Both of these depend on external libraries, typically glibc and libicu. Using collations from these providers means that if you upgrade your glibc version or libicu version, behaviour can change, and things like indexes can become corrupt since they would have been built using a different version. The ordering in the index might not match the calculations in the new version.

The potential for silent corruption from these is reduced in recent versions of Postgres as Postgres emits warnings if the underlying glibc or libicu version changes compared to what it knows about. These warnings are more reliable and comprehensive for the icu provider as glibc changes may not always be detectable (see notes in the ALTER COLLATION docs). Postgres also emits warnings in more situations for the icu provider (see this article).

Typical collations

Databases have a collation set at creation time and you can't change this after the fact. This means selecting an appropriate collation at that time is important.

At work, our databases use a common default, the C collation. This is a special collation in that while it ostensibly comes from the libc provider, it's actually implemented internally in Postgres and does not depend on libc. This commit's message discusses this.

The new builtin collation provider

Postgres 17 includes a new collation provider, a built in one. This provider does not depend on external libraries, which avoids the risk of library upgrades impacting behaviour unexpectedly.

It currently provides two collations: ucs_basic (similar to the C collation) and pg_c_utf8 (similar to the C collation except it treats non-ASCII characters as characters for the purposes of operations like lower()). These were added in these commits, which have commit messages worth reading.

Which collation should you use?

This depends on the context, and there are trade offs.

Based on my reading, these are my recommendations:

I expand on these below.

Database collation

Each database has a collation set. This applies by default if a column or query does not have a collation applied.

I think a good choice for a database is the ucs_basic collation (or the similar "libc" C collation if you are on a version before Postgres 17). This is because:

However, it has the downside that it does not treat non-ASCII characters as characters for common operations, so lower('À') is À for example. As well, when sorting it orders by Unicode code point value, which often is not ideal when viewed by a human. You can address both of these limitations either on a per column basis or on a per query basis (see below).

Another option is pg_c_utf8, available in Postgres 17+. This has the benefit that it does not depend on glibc/libicu and operations like lower('À') is à. However, a drawback is it is only stable within a major Postgres version. This means you are at risk of needing to reindex indexes on text columns when upgrading, which seems problematic.

Sorting

Assuming your database collation is ucs_basic, ordering by text columns will not produce results that make sense from a natural language perspective. For example, if we order the strings 'a', 'b', 'à', we would get the ordering 'a', 'b', 'à', but it would be more natural to see 'a', 'à', 'b'. For queries where this is important (such as ordering for showing a human), you can specify a collation that orders in a natural language way rather than by Unicode code point value.

To do this, instead of SELECT * FROM my_table ORDER BY email you can write SELECT * FROM my_table ORDER BY email COLLATE unicode. unicode is a language agnostic collation that provides a natural language ordering. Note that this comes from the icu provider, which also provides language and region specific collations such as en-x-icu or en-CA-x-icu that you might choose to use instead of unicode.

Note that pg_c_utf8 sorts by Unicode code points just like ucs_basic, so it doesn't solve this problem.

Case insensitive comparisons

You may have queries where you want to compare text values case insensitively. With the ucs_basic collation, lower('À') is 'À'. This is because this collation only considers ASCII characters as characters. In order to make this actually lowercase the character, apply a collation that treats non-ASCII characters as characters: lower('À' COLLATE pg_c_utf8) will give 'à'. pg_c_utf8 is a good choice as it is aware of non-ASCII characters and does not depend on libc or icu.

Note for such queries you should also consider using normalize() to apply Unicode normalization. In these examples I omit this aspect for simplicity.

For indexes for queries you want to be case insensitive, you have a few options:

You can insert the text after lowercasing it. Note this loses the original case, which you may or may not care about:

CREATE TABLE my_table (my_column text);
CREATE INDEX ON my_table (my_column);

-- Lowercase in Postgres.
INSERT INTO my_table VALUES (lower($1 COLLATE pg_c_utf8));
SELECT * FROM my_table WHERE my_column = lower($1 COLLATE pg_c_utf8);

-- Or lowercase $1 in the application.
INSERT INTO my_table VALUES ($1);
SELECT * FROM my_table WHERE my_column = $1;

Alternatively you can insert the text as is and set a collation on the column and create a functional index. This has the benefit of retaining the original case if that is important to you:

CREATE TABLE my_table (my_column text COLLATE pg_c_utf8);
CREATE INDEX ON my_table (lower(my_column));
INSERT INTO my_table VALUES ($1);
SELECT * FROM my_table WHERE lower(my_column) = lower($1);

Conceivably we could put the COLLATE in the functional index and not set a collation on the column to tweak this solution.

A downside of using pg_c_utf8 for these is that you may need to rebuild the index when upgrading major Postgres versions since the collation is only stable within major versions. However, there is no collation that avoids that while supporting non-ASCII characters, so we're stuck with that limitation.

Note the first solution has another problem: Values stored in the table may be lowercased with different rules across major versions, which is not ideal.

Another option is to use a non-deterministic collation from the icu provider. See the und-u-ks-level2 example in the Postgres docs. A downside with this is it depends on icu. As well, being non-deterministic means certain operations won't work (though parts of this are changing in Postgres 18), and there is a performance penalty. This option is easy to use in that you don't need to match the functional index expression when querying.

Further reading

There are a few articles I found useful when looking into this, beyond the Postgres collation docs: