PostgreSQL query optimization techniques for better database performance

PostgreSQL query optimization techniques for better database performance

In my previous article, I discussed why Postgres performance tuning matters, warning signs you should pay attention to know if your Postgres database is struggling, and 8 parameters you can try to improve its performance and see results immediately. Today, we’ll take a closer look at Postgres query optimization techniques to fine-tune your queries for database efficiency.

One article is not enough to master what it takes to optimize Postgres databases. In working with clients of all types of challenges and across industries, including Nextdoor, Clubhouse, and Coinbase, I’ve seen firsthand the complexities and challenges that come with Postgres performance tuning.

To optimize PostgreSQL queries, first understand how they work

Before we get into the nitty gritty of optimization techniques, it’s worth doing a refresher (if you’re an experienced data practitioner) on how Postgres queries run. Understanding the fundamentals helps shed light on why and how you can then apply target Postgres query optimization techniques.

When you submit a query to Postgres, it first undergoes parsing, which checks your SQL for any syntax errors and is then converted into a parse tree. This is an internal representation that Postgres can work with–if this sounds a bit too abstract, bear with me, we’re almost there.

Next, the query enters the planning and optimization phase. Here, Postgres generates various execution plans to retrieve what the SQL command is requesting. The optimizer evaluates these plans with the intention of choosing the most efficient plan since it wants to minimize execution time and resource usage. How does it decide what’s efficient? This is where the query optimization techniques come in, as factors like available indexes, join strategies, and data distribution affect execution efficiency.

After the optimal plan has been selected, the executor then accesses data (from disk or memory), applies any needed filters, sorts, and aggregates results as laid out in the SQL query.

It’s probably clearer now why a well-designed query can significantly reduce execution time.

Techniques to identify slow Postgres queries for optimization

One of the things we do when reviewing any Postgres database is to look for less efficient Postgres queries.

Tools to identify inefficient Postgres queries

We can leverage Postgres’ built-in tools for monitoring and performance tuning:

  • pg_stat_statements: tracks execution statistics for all SQL statements executed by the server, making it a powerful tool for identifying slow queries.
  • pg_stat_activity: provides real-time insights into currently running queries, allowing you to identify long-running or stuck queries that may be impacting performance.

There are also other tools like pgBadger, which parses Postgres logs and generates detailed performance reports specifically highlighting slow queries and even suggests how you can optimize them.

You may also be familiar with companies like Datadog and New Relic, which offer Application Performance Management (APM) products for monitoring. These have more features in that they track query performance across your entire stack, providing a holistic view of how your database interacts with your application.

Performance thresholds and alerts

APM tools can set performance thresholds, which let you know via alerts whenever query performance degrades. The intent is to catch problems before larger, and more costly, downstream impacts.

What’s less obvious is what those performance thresholds should be set at. Fine-tuning a Postgres database, or any database, is incredibly use case specific. Our customers have applications with very different performance expectations; what’s acceptable for them, might not be acceptable for you.

Here’s an example that illustrates this more clearly. For an ecommerce platform, database performance is particularly crucial during peak periods on special holidays (e.g. Christmas). Slow queries during these times would adversely impact the user experience and thus sales. Your data engineer might set a performance threshold for read queries like product searches at 200 milliseconds, and for write queries like placing an order at 1 second. Anything slower than those thresholds triggers an alert for investigation.

But you wouldn’t need the same thresholds for an internal analytics system. The thresholds can be set much higher, say 2 seconds, since the use case is your colleagues running reports on an ad hoc basis.

How to set Postgres performance thresholds

In general, we recommend starting with baseline thresholds based on general best practices and adjusting them over time, as you will gain more insight into how the application performs under different conditions and pressures. For instance, set a query execution time of 500 milliseconds, and if you find that most queries complete well within 300 milliseconds, tighten the threshold to catch issues sooner.

Understanding what end users and stakeholders expect is also key to refining thresholds. Ultimately, these thresholds are not arbitrarily set based on what’s fastest or technologically possible: they’re set in alignment with business goals. To return to our internal analytics example, users might be ok with slower performance for complex end-of-month reports, but expect faster response times for their day-to-day analytic development work.

Top 4 techniques for PostgreSQL query optimization

I’ll now turn to the four techniques that aren’t usually discussed in sufficient detail, at least not beyond the basics. In each of the four, I’ll focus on how these offer long-term, sustainable improvements that make your Postgres database more efficient.

Indexes

As always, let’s start with indexes. They’re one of the most powerful tools for improving query performance. By creating indexes that are frequently used in WHERE clauses, JOIN conditions, and ORDER BY statements, you can significantly reduce the amount of data that needs to be scanned.

Postgres supports different types of indexes, including B-tree, hash, and GIN indexes, which are each best suited for different types of queries. For example, you’ll want to use a B-tree index for columns where you need to perform equality and range queries:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

With this index, queries filtering by customer_id will execute much faster. As mentioned in our previous article, indexes are not a free lunch: they can slow down write operations (INSERT, UPDATE, DELETE) due to the overhead of maintaining the index. So, it's important to strike a balance and only create indexes that provide significant performance benefits for your most frequent and time-sensitive queries.

Before adding an index, use a tool like pg_stat_statements to gather insights on query performance and identify which columns are frequently involved in filtering or sorting. Indexes need to be reviewed and maintained over time too: remove any that are no longer helpful or are rarely used, since this can still impact write performance.

Writing optimized queries

Writing more efficient SQL commands requires understanding how Postgres execute them (my primer on this earlier in the article comes in handy here!). You want to avoid making common mistakes like unnecessary subqueries and inefficient JOINs. For example, use a JOIN instead of a subquery here:

Using JOINs instead of subqueries

Before optimizing:

SELECT name, total
FROM customers
WHERE id IN (
    SELECT customer_id
    FROM orders
    WHERE total > 100
);

After optimizing:

SELECT customers.name, orders.total
FROM customers
JOIN orders ON customers.id = orders.customer_id
WHERE orders.total > 100;

Why does such a simple change impact performance so much? By replacing the subquery with a JOIN, the optimizer can better understand the query structure, leading to more efficient execution plans.

Quite often, you won’t be writing new queries from scratch, but instead looking through complex lines of SQL queries nested within each other. Where do you start refactoring for better performance?

Common Table Expressions

One of my go-to techniques is to break down complex queries into simpler parts. This not only improves readability but makes it easier for Postgres to optimize the query. Common Table Expressions (CTEs) come in handy here:

Before:

SELECT customer_id, total
FROM orders
WHERE order_date > NOW() - INTERVAL '1 month'
AND total > (SELECT AVG(total) FROM orders);

After:

WITH RecentOrders AS (
  SELECT customer_id, total
  FROM orders
  WHERE order_date > NOW() - INTERVAL '1 month'
),
AverageTotal AS (
  SELECT AVG(total) AS avg_total FROM orders
)
SELECT r.customer_id, r.total
FROM RecentOrders r, AverageTotal a
WHERE r.total > a.avg_total;

Window functions

Window functions can also provide efficient ways to perform calculations across rows related to the current row, often eliminating the need for subqueries:

Before:

SELECT o1.customer_id, o1.order_date, o1.total,
       (SELECT SUM(o2.total)
        FROM orders o2
        WHERE o2.customer_id = o1.customer_id
          AND o2.order_date <= o1.order_date) AS running_total
FROM orders o1
ORDER BY o1.customer_id, o1.order_date;

After:

SELECT customer_id, order_date, total,
       SUM(total) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM orders
ORDER BY customer_id, order_date;

Because window functions are designed to work with sets of rows related to the current row, they reduce the computational overhead for queries run on large datasets.

Schema design, or denormalization

Effectively using Postgres optimization techniques like CTEs and window functions often requires thinking about schema design too.

Denormalization involves merging tables to reduce the complexity of JOIN operations and can speed up read operations. In a normalized schema, you might have separate tables for customers and orders, and so any SQL query that needs data from both tables would use a JOIN operation:

SELECT c.customer_id, c.customer_name, o.order_id, o.order_date, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

With denormalization, you merge the data into a single table to avoid the JOIN operation and the query is now simplified:

SELECT customer_id, customer_name, order_id, order_date, total
FROM customer_orders;

Like with indexes, denormalization comes with trade-offs. Here, you trade data integrity and storage efficiency for better read performance. A big problem is introducing the risk of data anomalies. Any updates to upstream data sources must be accurately propagated across multiple rows in the denormalized table.

So how does schema design relate to CTEs and window functions? If you frequently perform complex aggregations or calculations, denormalizing your schema might make sense. This could involve precomputing summary tables that store precomputed results. But achieving the best performance may not be possible without also refactoring complex queries. CTEs and window functions can provide additional performance gains for complex aggregations or calculations that are not easily addressed through schema design alone.

By using CTEs and window functions together with a well-designed schema, you can achieve more efficient and scalable query performance. It’s helpful to see query refactoring and schema design as complementary strategies that amplify each other’s ability to optimize data performance.

Caching

Caching frequently accessed data can drastically improve performance. Two types of caching I frequently turn to improve Postgres performance are materialized views and query caching, and they operate very differently.

Materialized views

Materialized views are a form of physical data caching–Postgres stores the results physically in the database, which can then be queried like a regular table. The advantage is that you don’t need to recompute the query every time it’s run. This is useful for situations where you have complex (expensive) queries that do not need to update based on real-time data.

Before:

SELECT o.order_id, o.order_date, o.total, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > NOW() - INTERVAL '1 month'
ORDER BY o.order_date DESC;

After:

CREATE MATERIALIZED VIEW recent_orders AS
SELECT o.order_id, o.order_date, o.total, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > NOW() - INTERVAL '1 month';

You can now just query the materialized view directly:

SELECT * FROM recent_orders
ORDER BY order_date DESC;

And refresh it periodically with this nifty command:

REFRESH MATERIALIZED VIEW recent_orders;

Query caching

With query caching, you store the results of frequently executed queries in memory so that subsequent queries are not executed against the database. You can implement this at the application level (perhaps backed by something like Redis) or by using tools like pgpool-II, which has an in-memory query cached to save results. This is better than materialized views if you’re dealing with data that changes frequently.

Continuous PostgreSQL query monitoring and improvement

In our next article, we’ll wrap our Postgres series up by turning to advanced Postgres performance tuning and monitoring techniques. Postgres query optimization is an ongoing process that requires a keen understanding of your application’s workload, data, and hardware setup. Optimization techniques are not done for the sake of optimizing as much as you can; fundamentally, it comes down to what the business needs, and this will change over time.

By trying out some of the tools and techniques laid out here, you’ll gain more first-hand experience and become more confident with interpreting performance data and recognizing red flags. Knowing what performance thresholds to set is highly domain-specific and depends on a deep understanding of the specific application’s requirements, user expectations, and typical usage patterns. For example, frequent alerts for query execution times just above your threshold might indicate the need for better indexing or query optimization.

The tools are helpful, but ultimately what you need is a database performance expert with decades of experience–someone who can pattern-match your database’s performance and recognize potential problems. This is something we’ve refined over time through experience working across a range of Postgres optimization challenges.

Our team of experts brings decades of experience with modernizing Postgres and other databases. We’ve scaled applications that served millions of users at Nextdoor, Clubhouse, and Coinbase.

If you’re facing poor Postgres performance today and want to have a more scalable and efficient database, there are a few ways to learn more:

  • Book a call to learn how we can best help level up your Postgres application today
  • Engage our team in an extensive review of your Postgres setup and receive a customized optimization strategy

Read more