The One and the Many

More PostgreSQL generic plan fun

On a recent project, I was checking whether data existed for networks in a large table. When running my program, I found that one of my queries was quick at first and then became slow. It turned out that Postgres had begun using a generic plan that performed much more poorly than the custom plan.

I was running queries like this:

SELECT EXISTS (
    SELECT
    FROM big_table
    WHERE ip_address <<= $1
)

I was executing them using pgx as my Postgres driver which automatically prepares and caches statements.

The first few executions were quick, as were executions in psql.

At first I didn't understand what was going on, especially when I saw the same query executed quickly in psql. However I realized that it could be a generic plan issue given past experience. Examining the custom and generic plans shows a huge difference:

The custom plan:

Result  (cost=5.84..5.85 rows=1 width=1)
  InitPlan 1 (returns $0)
    ->  Index Only Scan using big_table__ip_address_time_partial on big_table  (cost=0.57..5.84 rows=1 width=0)
          Index Cond: ((ip_address >= 'dead:beef::/62'::inet) AND (ip_address <= 'dead:beef:ffff:ffff:ffff:ffff'::inet))
          Filter: ((ip_address)::inet <<= 'dead:beef::/62'::inet)

The generic plan:

Result  (cost=5.66..5.67 rows=1 width=1)
  InitPlan 1 (returns $0)
    ->  Index Only Scan using big_table__ip_address_time_partial on big_table  (cost=0.57..8762195.67 rows=1721443 width=0)
          Filter: ((ip_address)::inet <<= $1)

(I checked the plans using the same technique I describe in this post).

Postgres uses a heuristic whereby after five executions of a prepared statement it will begin using a generic plan if it looks less costly than the custom one. This is described here.

I wasn't yet sure that Postgres was actually switching to the custom plan though. To check, I told Postgres to never use a generic plan:

SET plan_cache_mode = force_custom_plan

This made the query consistently execute quickly, which means the generic plan was the problem.

One thing that is still confusing to me is why Postgres decided it was a good idea to use the generic plan when the cost is clearly far worse. I believe it is because the cost of the topmost node is cheaper in the generic plan, which is odd given it has a child node that is much more expensive. I had thought parent nodes include the cost of their child nodes, but clearly that is not the case here.

Update: I've realized I was confused in the above paragraph. The reason Postgres would use the generic plan is that it has a cheaper estimate (based on the top most node). While the plan in practice performs worse, the estimate is what Postgres uses.

Regarding the child node having a higher estimate than the parent: That is explained by this (from the EXPLAIN docs):

Also, if you limit the number of rows to return with a LIMIT clause, the planner makes an appropriate interpolation between the endpoint costs to estimate which plan is really the cheapest.

The estimate of the Index Only Scan assumes we would fetch all of the rows, but we do not because of the EXISTS in the query.