Coding Comrade

Efficient way to select the first row from each group in PostgreSQL

November 29, 2023

Recently, I encountered a situation where I had to optimize an SQL query due to PostgreSQL’s excessive CPU utilization. The query itself seemed straightforward, but it was consistently taking more than 30 seconds to complete. Considering it was a frequently used query, it was causing frequent CPU spikes, impacting the performance of other queries.

Here’s the original sample query:

SELECT
    *
FROM
    (
        SELECT
            tbl1.<columns>,
            tbl2.<columns>
        FROM
            tbl1
            JOIN tbl2 ON tbl1.id = tbl2.id
        WHERE
            tbl1.other_id IN (
                SELECT
                    min(tbl1.other_id)
                FROM
                    tbl1
                    JOIN tbl2 ON tbl1.id = tbl2.id
                WHERE
                    <some conditions>
                GROUP BY
                    tbl1.other_id
            )
        ORDER BY
            tbl1.other_id
    ) AS innerQuery
LIMIT 300;

Accompanied by its execution plan:

 Limit  (cost=11686.42..12241.94 rows=300 width=371) (actual time=31094.513..31094.518 rows=1 loops=1)
   ->  Nested Loop  (cost=11686.42..7597359.01 rows=4118716 width=371) (actual time=31094.512..31094.517 rows=1 loops=1)
         ->  Merge Join  (cost=11686.13..6340599.52 rows=4118716 width=243) (actual time=31094.478..31094.483 rows=1 loops=1)
               Merge Cond: (tbl1.other_id = "ANY_subquery".min)
               ->  Index Scan using <some idx> on tbl1  (cost=0.56..6308317.36 rows=8237433 width=243) (actual time=0.016..30363.563 rows=7959843 loops=1)
               ->  Sort  (cost=11685.57..11686.07 rows=200 width=8) (actual time=0.628..0.629 rows=1 loops=1)
                     Sort Key: "ANY_subquery".min
                     Sort Method: quicksort  Memory: 25kB
                     ->  HashAggregate  (cost=11675.93..11677.93 rows=200 width=8) (actual time=0.616..0.619 rows=1 loops=1)
                           Group Key: "ANY_subquery".min
                           ->  Subquery Scan on "ANY_subquery"  (cost=11613.16..11674.72 rows=482 width=8) (actual time=0.613..0.613 rows=1 loops=1)
                                 ->  Finalize GroupAggregate  (cost=11613.16..11669.90 rows=482 width=16) (actual time=0.612..0.612 rows=1 loops=1)
                                       Group Key: tbl1.other_id
                                       ->  Gather Merge  (cost=11613.16..11663.07 rows=402 width=16) (actual time=0.610..0.730 rows=1 loops=1)
                                             Workers Planned: 2
                                             Workers Launched: 0
                                             ->  Partial GroupAggregate  (cost=10613.13..10616.65 rows=201 width=16) (actual time=0.104..0.104 rows=1 loops=1)
                                                   Group Key: tbl1.other_id
                                                   ->  Sort  (cost=10613.13..10613.63 rows=201 width=16) (actual time=0.102..0.102 rows=1 loops=1)
                                                         Sort Key: tbl1.other_id
                                                         Sort Method: quicksort  Memory: 25kB
                                                         ->  Nested Loop Left Join  (cost=1.53..10605.44 rows=201 width=16) (actual time=0.066..0.093 rows=1 loops=1)
                                                               Filter: ((<condition1>) OR (<condition2>))
                                                               ->  Nested Loop  (cost=0.97..9964.02 rows=248 width=24) (actual time=0.054..0.081 rows=1 loops=1)
                                                                     ->  Parallel Index Scan using <some idx> on tbl1  (cost=0.68..9158.50 rows=1125 width=32) (actual time=0.040..0.067 rows=1 loops=1)
                                                                           Index Cond: (<condition>)
                                                                           Filter: (<condition>)
                                                                           Rows Removed by Filter: 2
                                                                     ->  Index Scan using <some idx> on tbl2  (cost=0.29..0.72 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
                                                                           Index Cond: (<condition>)
                                                                           Filter: (<condition>)
         ->  Index Scan using <some idx> on tbl2  (cost=0.29..0.31 rows=1 width=136) (actual time=0.025..0.025 rows=1 loops=1)
               Index Cond: (<condition>)
 Planning time: 1.531 ms
 Execution time: 31094.852 ms

As can be seen, the query essentially retrieves the first row from a group of rows, yet it takes a staggering 31 seconds!

PostgreSQL offers a unique statement called SELECT DISTINCT ON specifically designed to achieve this goal in a much more efficient manner. It eliminates rows that match on all the specified expressions, ensuring that only the first distinct row for each group is returned.

Here’s the optimized query using SELECT DISTINCT ON:

select
    *
from
    (
        select
            distinct on (tbl1.other_id) tbl1.<columns>, tbl2.<columns>
        from
            tbl1
            join tbl2 on tbl1.id = tbl2.id
        where
            <some conditions>
        order by
            tbl1.other_id,
            tbl1.id
    ) as q
order by
    id
limit
    300;

This query’s execution plan is as follows:

 Limit  (cost=11082.00..11082.75 rows=300 width=371) (actual time=2.102..2.104 rows=8 loops=1)
   ->  Sort  (cost=11082.00..11083.13 rows=454 width=371) (actual time=2.101..2.102 rows=8 loops=1)
         Sort Key: q.id
         Sort Method: quicksort  Memory: 29kB
         ->  Subquery Scan on q  (cost=11055.15..11061.96 rows=454 width=371) (actual time=2.044..2.063 rows=8 loops=1)
               ->  Unique  (cost=11055.15..11057.42 rows=454 width=379) (actual time=2.042..2.053 rows=8 loops=1)
                     ->  Sort  (cost=11055.15..11056.28 rows=454 width=379) (actual time=2.041..2.042 rows=8 loops=1)
                           Sort Key: tbl1.other_id, tbl1.id
                           Sort Method: quicksort  Memory: 29kB
                           ->  Gather  (cost=1001.53..11035.11 rows=454 width=379) (actual time=1.505..2.116 rows=8 loops=1)
                                 Workers Planned: 2
                                 Workers Launched: 0
                                 ->  Nested Loop Left Join  (cost=1.53..9989.71 rows=189 width=379) (actual time=0.235..0.747 rows=8 loops=1)
                                       Filter: (<some conditions>)
                                       Rows Removed by Filter: 3
                                       ->  Nested Loop  (cost=0.97..9387.10 rows=233 width=379) (actual time=0.090..0.514 rows=11 loops=1)
                                             ->  Parallel Index Scan using <idx> on tbl1  (cost=0.68..8600.52 rows=1059 width=251) (actual time=0.057..0.351 rows=19 loops=1)
                                                   Index Cond: (<condition>)
                                                   Filter: (<condition>)
                                                   Rows Removed by Filter: 14
                                             ->  Index Scan using <idx> on tbl2  (cost=0.29..0.74 rows=1 width=136) (actual time=0.008..0.008 rows=1 loops=19)
                                                   Index Cond: (<condition>)
                                                   Filter: (<condition>)
                                                   Rows Removed by Filter: 0
 Planning time: 4.296 ms
 Execution time: 2.547 ms

Remarkably, the execution time drops to just 2 milliseconds, representing a staggering 99.9999% improvement!

Key Takeaways:

  • SELECT DISTINCT ON is a PostgreSQL-specific statement for efficiently selecting the first row from each group.
  • Ordering rows is crucial when using SELECT DISTINCT ON; otherwise, the order becomes unpredictable.
  • Replacing the original query with the optimized SELECT DISTINCT ON approach significantly reduces execution time and improves overall query performance.

I hope this blog post has shed light on the efficient way to select the first row from each group in PostgreSQL using SELECT DISTINCT ON. Remember to always optimize queries to ensure optimal database performance.

Happy coding!


Eugene Yamenko

Hi, I'm Eugene, a Software Engineer in Sydney, Australia.