PostgreSQL types surprise
I recently encountered a query that performed much slower than I expected. The cause turned out to be that a parameter value was treated as a type matching an index in one spot but not matching a different index in another spot. This was not obvious at first so I thought it would be interesting to discuss it.
I expected the query to take under 1ms to execute, but in some cases it took over 600ms. When first looking at it, I thought that the cause was that the query was using a generic plan rather than a custom plan, and the planner was making a poor choice as a result. This is because the plan looked good when I ran the query normally but looked poor when I used a generic plan.
The query looked like this:
EXPLAIN
WITH a AS (
SELECT MAX(total) AS total
FROM table_a
WHERE account_id = 123 AND sub_account_id = '456'
AND event_date > '2023-10-10 20:00:36.364844+00'::timestamptz - '2 days'
AND event_date < '2023-10-10 20:00:36.364844+00'::timestamptz
),
b AS (
SELECT total
FROM table_b
WHERE account_id = 123 AND sub_account_id = '456'
)
SELECT (SELECT total FROM a), (SELECT total FROM b)
And the plan:
Result (cost=7.06..7.07 rows=1 width=64)
InitPlan 1 (returns $0)
-> Aggregate (cost=3.59..3.60 rows=1 width=32)
-> Index Scan using table_a_pkey on table_a (cost=0.56..3.59 rows=1 width=6)
Index Cond: (((account_id)::integer = 123) AND (sub_account_id = '456'::text) AND (event_date > ('2023-10-10 20:00:36.364844+00'::timestamp with time zone - '2 days'::interval)) AND (event_date < '2023-10-10 20:00:36.364844+00'::timestamp with time zone))
InitPlan 2 (returns $1)
-> Index Scan using table_b_pkey on table_b (cost=0.43..3.45 rows=1 width=5)
Index Cond: (((account_id)::integer = 123) AND (sub_account_id = '456'::citext))
This plan looks good and fully uses indexes on both tables.
When I looked at the generic plan, the indexes weren't fully used (since
this is Postgres 15, I had to prepare the statement to see the generic plan
rather than use Postgres 16's EXPLAIN (GENERIC_PLAN)
):
PREPARE stmt AS
WITH a AS (
SELECT MAX(total) AS total
FROM table_a
WHERE account_id = $1 AND sub_account_id = $2
AND event_date > $3::timestamptz - '2 days'
AND event_date < $3::timestamptz
),
b AS (
SELECT total
FROM table_b
WHERE account_id = $1 AND sub_account_id = $2
)
SELECT (SELECT total FROM a), (SELECT total FROM b)
SET plan_cache_mode = force_generic_plan;
EXPLAIN execute stmt(123, '456', '2023-10-10 20:00:36.364844+00');
And the plan:
Result (cost=46807.63..46807.64 rows=1 width=64)
InitPlan 1 (returns $0)
-> Aggregate (cost=3.59..3.60 rows=1 width=32)
-> Index Scan using table_a_pkey on table_a (cost=0.56..3.58 rows=1 width=6)
Index Cond: (((account_id)::integer = $1) AND (sub_account_id = $2) AND (event_date > ($3 - '2 days'::interval)) AND (event_date < $3) AND (event_date = $3))
InitPlan 2 (returns $1)
-> Bitmap Heap Scan on table_b (cost=1111.34..46804.02 rows=1 width=5)
Recheck Cond: ((account_id)::integer = $1)
Filter: ((sub_account_id)::text = $2)
-> Bitmap Index Scan on table_b_pkey (cost=0.00..1111.34 rows=80121 width=0)
Index Cond: ((account_id)::integer = $1)
This plan's estimate is much worse, and it only uses the account_id
component of the multi-column (account_id, sub_account_id)
index on
table_b
(but still fully uses the (account_id, sub_account_id,
event_date)
index on table_a
).
This was the plan that the query was using, and explains why the query was taking much longer in some cases: The index was not being fully used and so we had to filter many tuples from the heap.
Seeing this lead me to believe the generic plan was the problem. However, I realized that Postgres does not use the generic plan if the estimate is much worse than the custom plan (as described here), so this being the cause didn't make sense.
I realized that the types could be the issue. Sure enough, the column type
for sub_account_id
was different on the two tables (and as a result also
different in the two indexes). It was text
on table_a
but citext
on
table_b
. Postgres inferred from context that $2
's type was text
which
was fine for table_a
, but the second use of $2
with table_b
made this
choice a problem since the (account_id, sub_account_id)
index could not
be fully used due to sub_account_id
being a different type.
To fix this, we can change the second instance of $2
to $2::citext
.
(Another solution would be to pass sub_account_id
as a separate
parameter, e.g. by adding a $4
to the query. This would let Postgres pick
citext
from context. I'm not sure if there's a reason to prefer one or
the other).
In hindsight the cause is obvious, but it took me a while to realize what was going on.
One thing I need to keep in mind when looking at EXPLAIN
output in the
future is that how you provide the parameter values can affect the plan, as
shown here. If I had looked at the parameterized version of the query at
first rather than inlining parameters in the query text and running it with
psql
, it would have been clearer what was going on.
To do that, one way is to run the EXPLAIN
query in a similar way as the
application does. (You can still retrieve the EXPLAIN
output that way).
For example, in my case I could call the EXPLAIN
query via Go code. This
also would show any surprises stemming from how the Postgres driver
provides parameters, which can likely impact behaviour too.