Collations in PostgreSQL
At work, we recently upgraded to PostgreSQL 17. One of the interesting additions in this version is a built in collation provider. I investigated what this means and thought I would write down what I learned.
Collations
What is a collation in Postgres? Basically it is something that defines how
sorting text works, such as where non-ASCII characters appear in the
ordering. It also controls certain operations on text, such as comparisons
and whether operations like lower()
understand non-ASCII characters. As
is often the case, the Postgres
docs cover the
topic well.
There are many collations and they can come from different providers.
Collation providers
Prior to Postgres 17 there were two collation providers: libc and icu. Both of these depend on external libraries, typically glibc and libicu. Using collations from these providers means that if you upgrade your glibc version or libicu version, behaviour can change, and things like indexes can become corrupt since they would have been built using a different version. The ordering in the index might not match the calculations in the new version.
The potential for silent corruption from these is reduced in recent versions of Postgres as Postgres emits warnings if the underlying glibc or libicu version changes compared to what it knows about. These warnings are more reliable and comprehensive for the icu provider as glibc changes may not always be detectable (see notes in the ALTER COLLATION docs). Postgres also emits warnings in more situations for the icu provider (see this article).
Typical collations
Databases have a collation set at creation time and you can't change this after the fact. This means selecting an appropriate collation at that time is important.
At work, our databases use a common default, the C
collation. This is a
special collation in that while it ostensibly comes from the libc provider,
it's actually implemented internally in Postgres and does not depend on
libc. This commit's
message
discusses this.
The new builtin collation provider
Postgres 17 includes a new collation provider, a built in one. This provider does not depend on external libraries, which avoids the risk of library upgrades impacting behaviour unexpectedly.
It currently provides two collations: ucs_basic
(similar to the C
collation) and pg_c_utf8
(similar to the C
collation except it treats
non-ASCII characters as characters for the purposes of operations like
lower()
). These were added in
these
commits,
which have commit messages worth reading.
Which collation should you use?
This depends on the context, and there are trade offs.
Based on my reading, these are my recommendations:
- Database level:
ucs_basic
- Sorting if you want natural language ordering:
unicode
, otherwise use the database default ofucs_basic
. - Text operations if you want non-ASCII support:
pg_c_utf8
, otherwise use the database default ofucs_basic
.
I expand on these below.
Database collation
Each database has a collation set. This applies by default if a column or query does not have a collation applied.
I think a good choice for a database is the ucs_basic
collation (or the
similar "libc" C
collation if you are on a version before Postgres 17).
This is because:
- It is stable across Postgres versions, meaning you won't have to reindex anything when upgrading Postgres.
- It does not depend on external libraries, meaning you won't have the risk of broken indexes when upgrading glibc or libicu.
- It is fast.
However, it has the downside that it does not treat non-ASCII characters as
characters for common operations, so lower('À')
is À
for example. As
well, when sorting it orders by Unicode code point value, which often is
not ideal when viewed by a human. You can address both of these limitations
either on a per column basis or on a per query basis (see below).
Another option is pg_c_utf8
, available in Postgres 17+. This has the
benefit that it does not depend on glibc/libicu and operations like
lower('À')
is à
. However, a drawback is it is only stable within a
major Postgres version. This means you are at risk of needing to reindex
indexes on text columns when upgrading, which seems problematic.
Sorting
Assuming your database collation is ucs_basic
, ordering by text columns
will not produce results that make sense from a natural language
perspective. For example, if we order the strings 'a', 'b', 'à'
, we would
get the ordering 'a', 'b', 'à'
, but it would be more natural to see 'a',
'à', 'b'
. For queries where this is important (such as ordering for
showing a human), you can specify a collation that orders in a natural
language way rather than by Unicode code point value.
To do this, instead of SELECT * FROM my_table ORDER BY email
you can
write SELECT * FROM my_table ORDER BY email COLLATE unicode
. unicode
is
a language agnostic collation that provides a natural language ordering.
Note that this comes from the icu provider, which also provides language
and region specific collations such as en-x-icu
or en-CA-x-icu
that you
might choose to use instead of unicode
.
Note that pg_c_utf8
sorts by Unicode code points just like ucs_basic
,
so it doesn't solve this problem.
Case insensitive comparisons
You may have queries where you want to compare text values case
insensitively. With the ucs_basic
collation, lower('À')
is 'À'
. This
is because this collation only considers ASCII characters as characters. In
order to make this actually lowercase the character, apply a collation that
treats non-ASCII characters as characters: lower('À' COLLATE pg_c_utf8)
will give 'à'
. pg_c_utf8
is a good choice as it is aware of non-ASCII
characters and does not depend on libc or icu.
Note for such queries you should also consider using normalize()
to apply
Unicode normalization. In these examples I omit this aspect for simplicity.
For indexes for queries you want to be case insensitive, you have a few options:
You can insert the text after lowercasing it. Note this loses the original case, which you may or may not care about:
CREATE TABLE my_table (my_column text);
CREATE INDEX ON my_table (my_column);
-- Lowercase in Postgres.
INSERT INTO my_table VALUES (lower($1 COLLATE pg_c_utf8));
SELECT * FROM my_table WHERE my_column = lower($1 COLLATE pg_c_utf8);
-- Or lowercase $1 in the application.
INSERT INTO my_table VALUES ($1);
SELECT * FROM my_table WHERE my_column = $1;
Alternatively you can insert the text as is and set a collation on the column and create a functional index. This has the benefit of retaining the original case if that is important to you:
CREATE TABLE my_table (my_column text COLLATE pg_c_utf8);
CREATE INDEX ON my_table (lower(my_column));
INSERT INTO my_table VALUES ($1);
SELECT * FROM my_table WHERE lower(my_column) = lower($1);
Conceivably we could put the COLLATE
in the functional index and not set
a collation on the column to tweak this solution.
A downside of using pg_c_utf8
for these is that you may need to rebuild
the index when upgrading major Postgres versions since the collation is
only stable within major versions. However, there is no collation that
avoids that while supporting non-ASCII characters, so we're stuck with
that limitation.
Note the first solution has another problem: Values stored in the table may be lowercased with different rules across major versions, which is not ideal.
Another option is to use a non-deterministic collation from the icu
provider. See the und-u-ks-level2
example in the Postgres
docs. A downside
with this is it depends on icu. As well, being non-deterministic means
certain operations won't work (though parts of this are changing in
Postgres 18), and there is a performance penalty. This option is easy to
use in that you don't need to match the functional index expression when
querying.
Further reading
There are a few articles I found useful when looking into this, beyond the Postgres collation docs: