3 PostgreSQL Mistakes That Will Cause Outages

By examining real-world scenarios encountered at Coinbase and Nextdoor, this post presents a technical exploration of these challenges, offering guidance to teams seeking to enhance their database operations.

Niall OHiggins

30 Apr 2024 — 5 min read

Effective Postgres management often involves navigating complex issues that can impact database performance and stability. By examining real-world scenarios I encountered at Coinbase and Nextdoor, this post presents a technical exploration of these challenges, offering guidance to teams seeking to enhance their database operations.

Query Plan Degradation Post-Upgrade at Coinbase

When I was a senior engineering leader at Coinbase, my team was responsible for a large number of business-critical services backed by Postgres databases which needed to be upgraded on a tight schedule. The vast majority of these database upgrades went smoothly and there was little or no service impact.

However, for one particularly important service, the database upgrade caused the performance of common queries to drop to such a level that the service became saturated. Queries were so slow that requests were backing up and the service was lagging behind transaction activity on the blockchain.

Diagnosing and Fixing the Slowdown

Investigation with EXPLAIN ANALYZE revealed that the query planner was no longer utilizing indexes effectively, opting instead for less efficient sequence scans. Once we realized that the statistics needed to be updated, we executed ANALYZE thus updating the query planner statistics and informing the database that it should use the index instead of performing a sequential scan. With query performance back to pre-upgrade levels, we quickly cleared the task backlog and service returned to normal.

Key tools and commands for addressing this issue include:

- EXPLAIN ANALYZE: Essential for identifying whether PostgreSQL is performing a sequence scan instead of using an index. It provides a detailed breakdown of the query execution plan.

- ANALYZE: This command updates the database statistics, allowing the query planner to make better execution decisions. It's a critical step in restoring query performance post-upgrade.

- ANALYZE VERBOSE: This gives you more feedback and sense of progress than plain ANALYZE and also shows the statistical sample size which can be useful when you need to tweak them.

Avoiding Upgrade Management: Serverless Postgres

Something I wish we had when I was in this situation was a modern serverless Postgres vendor such as neon.tech or Timescale.

Having the database hosting provider fully manage engine upgrades on your behalf removes a massive operational headache. It's not an option in all environments, but is definitely worth exploring.

For those looking to consider alternate vendors or automate database upgrades while minimizing the risk of such issues, StepChange's Postgres Performance & Cost Review service offers a solution that includes automatic statistics updates as part of the upgrade & migration process.

Handling ALTER TABLE Operations in Production: Lessons from Nextdoor

During my time as an engineering leader at Nextdoor, my team was responsible for the production deployment process. During one deployment, we encountered significant downtime during a routine Django migration that changed a column type in a production database. This operation, while trivial in a development setting, caused extensive locking on a table with millions of rows, effectively halting the entire site.

The Core Issue

Directly altering the type of a column in a large production database can lead to extensive locking. This is because the database must access and potentially update every row in the table to apply the change, which can be highly resource-intensive and time-consuming.

Recommended Approach

Based on this experience, the engineering team at Nextdoor adopted several best practices for schema modifications:

- Avoid Direct Type Changes: Rather than altering a column's type, the preferred approach is to add a new column with the desired type. This method is less invasive and avoids extensive table locks.

- Data Migration Scripts: Use scripts to migrate data from the old column to the new one, ensuring data consistency without impacting database performance.

- Ensure Backward Compatibility: Application code should be compatible with both the old and new schema versions to facilitate seamless rollbacks if needed.

To prevent recurrence, Nextdoor required database schema changes to be reviewed by database experts, introducing a checkpoint that, while slowing down the development process, significantly reduced the risk of operational issues.

Tools and Takeaways

For automated early detection of potentially disruptive schema changes, StepChange DB Guard is a tool that can be integrated with your CI/CD pipeline to identify such issues before they hit production. It's also important to be mindful of more nuanced changes, such as adjusting the precision of numeric types or the maximum size of varchar columns, which can have unexpected impacts on database performance.

Lock Contention and pg_locks NOT GRANTED in Postgres

At both Nextdoor and Coinbase, my teams dealt with high lock contention in Postgres databases - indicated by a high rate of pg_locks not being granted. This situation leads to performance issues and can cause outages if not properly managed.

Example Lock Contention Scenario

A simple but somewhat realistic (if silly) scenario which can cause very high lock contention in Postgres is the following:

Imagine you have a high-traffic website with content stored in a pages table.
You want to track when the page was last accessed, so you add a last_accessed column to the pages table with a timestamp.
On every page view, your web application runs an UPDATE query to set the last_accessed value to the current time.
With high enough traffic, you could be trying to update that last_accessed timestamp many, many times per second! This will cause a ton of lock contention in the database.
The easy fix in this scenario is to change the schema so that you run an INSERT on a different table like page_views with an append-only data model. This will avoid UPDATE to existing rows which will contend for locks with each other.

Diagnosing Lock Contention

To identify and analyze lock contention, Postgres offers a feature to log lock waits (log_lock_waits). This functionality helps pinpoint exactly which operations are waiting for locks and how long they're being blocked. For more detailed insights into lock waits and contention, the Postgres documentation provides a comprehensive guide here.

Mitigation Techniques

Several strategies can be implemented to mitigate lock contention:

- Implement Lock Timeout and Retry Logic: Setting a lock_timeout prevents operations from waiting indefinitely for locks, and incorporating retry logic in the application can help manage the impact of lock waits.

- Optimize Operation Ordering: By analyzing dependencies between operations, it's possible to reorder transactions to minimize locking conflicts.

- Split Transactions: Separating transactions that affect different resources can reduce the number of locks each transaction requires, thus lowering contention.

- Ensure Consistent Lock Ordering: Acquiring locks in a consistent order across transactions helps avoid deadlocks.

- Reduce Concurrency and Connection Pool Size: Lowering the level of concurrency and the size of the connection pool can help alleviate contention.

- Decrease Write Latency: Techniques such as removing unnecessary indexes, optimizing transactions, using Heap-Only Tuples (HOT) updates, tuning the vacuum process, disabling synchronous_commit, and upgrading database hardware can contribute to reduced write latency and lock contention.

Great Postgres Tools And Services Are Available

If you are facing similar issues with your Postgres performance or cost, it can be very challenging to solve them without the relevant expertise or tools. Being able to integrate automatic tools to detect & prevent database outages into your CI/CD pipeline is a huge confidence boost for your engineering team. It's also great to have external experts on-call to assist with risky upgrades and migrations - as well as help to resolve any cost or performance issues.

StepChange offers services like Postgres Performance & Cost Review, Data Migration & Modernization as well as Oracle to Postgres Migration. Additionally, we provide a DB Guard product which can integrate with your CI/CD pipeline to find issues before they reach production and cause an outage.

Don't hesitate to contact us if you need any advice or assistance!

Thanks to Casey Duncan, Harry Tormey and Joe Drumgoole for reviewing this post and providing feedback.

3 PostgreSQL Mistakes That Will Cause Outages

Niall OHiggins

Query Plan Degradation Post-Upgrade at Coinbase

Handling ALTER TABLE Operations in Production: Lessons from Nextdoor

Lock Contention and pg_locks NOT GRANTED in Postgres

Great Postgres Tools And Services Are Available

Read more

How to Build Your AI-assisted SRE/DBA/Performance Team in Minutes

Introducing APX's Linear Integration: Production Datadog + Postgres Issues to Linear

Cutting Cloud Costs with Coderbots: Cursor vs. Codex vs. Factory

Can AI Beat a Human Dev? Jules vs Codex on Live Django Issues with APX