The One and the Many

PostgreSQL types surprise

I recently encountered a query that performed much slower than I expected. The cause turned out to be that a parameter value was treated as a type matching an index in one spot but not matching a different index in another spot. This was not obvious at first so I thought it would be interesting to discuss it.

I expected the query to take under 1ms to execute, but in some cases it took over 600ms. When first looking at it, I thought that the cause was that the query was using a generic plan rather than a custom plan, and the planner was making a poor choice as a result. This is because the plan looked good when I ran the query normally but looked poor when I used a generic plan.

The query looked like this:

EXPLAIN
WITH a AS (
    SELECT MAX(total) AS total
    FROM table_a
    WHERE account_id = 123 AND sub_account_id = '456'
    AND event_date > '2023-10-10 20:00:36.364844+00'::timestamptz - '2 days'
    AND event_date < '2023-10-10 20:00:36.364844+00'::timestamptz
),
b AS (
    SELECT total
    FROM table_b
    WHERE account_id = 123 AND sub_account_id = '456'
)
SELECT (SELECT total FROM a), (SELECT total FROM b)

And the plan:

Result  (cost=7.06..7.07 rows=1 width=64)
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=3.59..3.60 rows=1 width=32)
          ->  Index Scan using table_a_pkey on table_a  (cost=0.56..3.59 rows=1 width=6)
                Index Cond: (((account_id)::integer = 123) AND (sub_account_id = '456'::text) AND (event_date > ('2023-10-10 20:00:36.364844+00'::timestamp with time zone - '2 days'::interval)) AND (event_date < '2023-10-10 20:00:36.364844+00'::timestamp with time zone))
  InitPlan 2 (returns $1)
    ->  Index Scan using table_b_pkey on table_b  (cost=0.43..3.45 rows=1 width=5)
          Index Cond: (((account_id)::integer = 123) AND (sub_account_id = '456'::citext))

This plan looks good and fully uses indexes on both tables.

When I looked at the generic plan, the indexes weren't fully used (since this is Postgres 15, I had to prepare the statement to see the generic plan rather than use Postgres 16's EXPLAIN (GENERIC_PLAN)):

PREPARE stmt AS
WITH a AS (
    SELECT MAX(total) AS total
    FROM table_a
    WHERE account_id = $1 AND sub_account_id = $2
    AND event_date > $3::timestamptz - '2 days'
    AND event_date < $3::timestamptz
),
b AS (
    SELECT total
    FROM table_b
    WHERE account_id = $1 AND sub_account_id = $2
)
SELECT (SELECT total FROM a), (SELECT total FROM b)

SET plan_cache_mode = force_generic_plan;

EXPLAIN execute stmt(123, '456', '2023-10-10 20:00:36.364844+00');

And the plan:

Result  (cost=46807.63..46807.64 rows=1 width=64)
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=3.59..3.60 rows=1 width=32)
          ->  Index Scan using table_a_pkey on table_a  (cost=0.56..3.58 rows=1 width=6)
                Index Cond: (((account_id)::integer = $1) AND (sub_account_id = $2) AND (event_date > ($3 - '2 days'::interval)) AND (event_date < $3) AND (event_date = $3))
  InitPlan 2 (returns $1)
    ->  Bitmap Heap Scan on table_b  (cost=1111.34..46804.02 rows=1 width=5)
          Recheck Cond: ((account_id)::integer = $1)
          Filter: ((sub_account_id)::text = $2)
          ->  Bitmap Index Scan on table_b_pkey  (cost=0.00..1111.34 rows=80121 width=0)
                Index Cond: ((account_id)::integer = $1)

This plan's estimate is much worse, and it only uses the account_id component of the multi-column (account_id, sub_account_id) index on table_b (but still fully uses the (account_id, sub_account_id, event_date) index on table_a).

This was the plan that the query was using, and explains why the query was taking much longer in some cases: The index was not being fully used and so we had to filter many tuples from the heap.

Seeing this lead me to believe the generic plan was the problem. However, I realized that Postgres does not use the generic plan if the estimate is much worse than the custom plan (as described here), so this being the cause didn't make sense.

I realized that the types could be the issue. Sure enough, the column type for sub_account_id was different on the two tables (and as a result also different in the two indexes). It was text on table_a but citext on table_b. Postgres inferred from context that $2's type was text which was fine for table_a, but the second use of $2 with table_b made this choice a problem since the (account_id, sub_account_id) index could not be fully used due to sub_account_id being a different type.

To fix this, we can change the second instance of $2 to $2::citext. (Another solution would be to pass sub_account_id as a separate parameter, e.g. by adding a $4 to the query. This would let Postgres pick citext from context. I'm not sure if there's a reason to prefer one or the other).

In hindsight the cause is obvious, but it took me a while to realize what was going on.

One thing I need to keep in mind when looking at EXPLAIN output in the future is that how you provide the parameter values can affect the plan, as shown here. If I had looked at the parameterized version of the query at first rather than inlining parameters in the query text and running it with psql, it would have been clearer what was going on.

To do that, one way is to run the EXPLAIN query in a similar way as the application does. (You can still retrieve the EXPLAIN output that way). For example, in my case I could call the EXPLAIN query via Go code. This also would show any surprises stemming from how the Postgres driver provides parameters, which can likely impact behaviour too.

Comments