Beyond the APM: Gaining Postgres insights for Rails Developers
What's missing from APMs?
Sales are dropping on your ecommerce shop. Customers are getting sluggish response times when they're viewing products for sale.
As a backend engineer supporting your Ruby on Rails and PostgreSQL ecommerce platform, you spring into action using your available tools. Your primary tool is Application Performance Monitoring (APM). You search through the endpoints to find the corresponding API endpoints for product inventory, to confirm user reports of slow activity.
As you expand the segments of request traces, you begin to identify slow SQL queries as the culprits. You've now found some slow queries to work with. While you've found where they are, you're not sure what’s causing them to be slow or what else was happening in PostgreSQL at the same time.
Your APM is very useful, however what you need are insights beyond the APM. You’d like to break down the SQL query execution to see which parts are slow. You'd like to see what else is happening within PostgreSQL. Unfortunately, you're in a time crunch now and company leadership wants updates!
In this article, I'll show you some additional open source tooling you can add to get insights beyond what your APM provides, so that you're better equipped to handle these issues in the future.
Limitations of APM analysis
As the backend Rails engineer using PostgreSQL, you’re concerned about the performance of your HTTP web endpoints and your background jobs.
APMs helpfully present the performance of HTTP and non-web request handling from the perspective of application concepts, like Active Record classes, Action Controller actions, and Active Job background job execution.
However, APMs are not without limitations. To save costs on high-activity applications and high cardinality data, APMs may sample data less frequently or have limited data retention, which means you may lack visibility into 100% of the activity from your users, or have access to the data for a limited amount of time.
Performance problems may only occur for a minority of users, however those users may be your most valuable ones! While your P95 performance may be acceptable, it's critical to get representative traces from those higher percentiles that demonstrate the performance problem. You may be missing those traces. In that case, you missing the information needed to identify the source of the problem.
Another scenario is that you may be progressively improving the performance of your system, moving up to higher percentiles. You want to solve performance problems for 95% of users, 98% of users and so on. Having full fidelity information is critical to achieve that.
For your slow products inventory API, imagine that you've found a slow trace and you're ready to dig in. You're wondering which parts made it slow, and why. What sorts of things would you look for?
Analyzing segments of performance information
APM tools usually represent request processing as stacked layers, attributing each layer to an aspect of the request processing.
As you dig into the SQL query layers, you're wondering how to get more information form the query execution.
Let's recap how we got here. Active Record, the ORM, generated SQL queries from your Rails app that were sent to PostgreSQL. We see their total time from the APM, a client side perspective where Rails is the client application, sent to Postgres, plus the response time. After that, additional processing happens like generating a JSON response.
SQL queries within APM tools are usually presented without their specific parameters. In Postgres, these are called "normalized" queries (or they could be prepared statements), which means the constant values like parameter values have been replaced with numerical placeholders. This means we often can't directly execute the parameterized version to get more execution information. We'll cover that more in a bit.
Next, the APM is not focused on Postgres execution typically, so we can't see what else is happening within PostgreSQL currently, or what queries had high latency in the recent past. There could have been many unrelated contributors to high latency that reduced available system resources, so it can be helpful to get a macro level perspective from the database instance as part of our performance investigation.
The challenges here are that neither individual database query execution, nor system wide or "macro" query performance analysis information is available to help your analysis.
While you're comfortable working with Active Record, these kinds of database performance investigations are difficult to debug. The tools and investigation process is different. Why is that?
Object-relational impedance mismatch
The Active Record ORM is an abstraction layer in Ruby and Active Record over relational structure and SQL.
When working at the application code level, you’re thinking about Ruby objects and business logic like service classes, Active Job background jobs, controllers or serializers.
With databases, performance analysis focuses on different concepts. We’re looking at the execution plans of SQL queries, data characteristics, indexes (See: The Path of a Query) on tables. In general we're concerned with how data is stored and fetched as needed by the application, and how that translates into accessing table row data and index data in pages on the filesystem, behind the scenes. Database pages are the storage unit. They're copied into memory buffers. We're not going into depth with these topics, but just mentioning here to illustrate that they're different concepts to work with in performance analysis. The good news is that PostgreSQL provides excellent visibility into these components once you know where to look. We'll show you how to leverage some of this information in this post.
When working with these different paradigms, you're navigating the object-relational impedance mismatch. Ruby supports imperative and functional programming styles, while SQL is a declarative language.
Since our APM tool is not providing us rich execution context within PostgreSQL, how can we solve that?
Solving sub-problems
To solve this performance challenge, let's break the problems into smaller sub-problems. We want to keep application context available, but we also need database execution information.
Here are some of the sub-problems we're working on:
- Gathering query execution from 100% of the queries in PostgreSQL, beyond limits of APM traces
- For a given SQL query, seeing how Postgres executes it and gathering a breakdown
- From all of the queries on the whole system, finding the most resource-intensive ones
- From our application-generated SQL queries, enriching them with context from the Rails application
Let's look at specific tactics for those challenges.
Gaining visibility into Postgres execution
Queries in PostgreSQL are planned and executed. We’ll use the module pg_stat_statements
(abbreviated here as: PGSS) which tracks 100% of successful query executions for our database up to a configured limit, including statistics about the execution like execution time and the total number of rows returned.
By sorting queries by mean execution time, we can identify the slowest ones. We can also look at queries by their "calls" figure to focus on queries called frequently.
Another dimension is to look at "rows", which is a cumulative figure, so we'll divide that by the number of calls. When a query returns a high number of rows, it may improve performance to add a new restriction at the application layer so that it fetches less rows, vs. a fix within PostgreSQL.
PGSS isn’t enabled by default, so first you’ll need to do that. PGSS adds a small bit of overhead, but the benefits are broadly considered significant and worthwhile to outweigh the minor resource overhead.
While PGSS captures 100% of successfully executed queries, for failed queries we'd need to get those from our PostgreSQL log. We can configure a max number of groups to capture, for example 5000.
Each query is placed into a group, where a query group is the normalized form of the query mentioned earlier. PGSS captures more than 30 fields of data about the group.
With this comprehensive coverage, we can now look across all of our captured queries and conduct a macro-level evaluation of queries on our instance.
However, let’s continue to focus on the specific product inventory query for now. Each query in PGSS is assigned a unique identifier called the queryid
. Using the queryid
, we can link a query to specific executions with parameters from our Postgres log. We may want to configure the capture of slow queries taking more than 1 second. Although this is scattered information, PostgreSQL provides the queryid
as a global identifier across PGSS, manual and automatic query execution plans, and logged queries in the Postgres log, given the appropriate configuration. When all of these are working together, we can gain deep insights into our PostgreSQL queries both at a macro level and at the level of individual queries.
Let's shift gears now to a different sub-problem. What about caveats to what's captured in PGSS? By default, one thing it lacks is application context.
To make PGSS even more evaluable, how can we enrich the query groups with application context?
Integrating application context with SQL queries
When reviewing PGSS query groups, we might recognize table names that correspond to Active Record models like products
or cart_items
, however when we're debugging specific problems, we want to be sure that certain parts of the application generated specific queries.
Ideally we'd like our SQL queries to know this application context:
- Application name
- Ruby class names for API controllers and action names
- Ruby class name for background jobs
- Source code line numbers
We do have application context in the APM. This is a good starting point for investigations driven from the client application. However, what if we flip things around and are driving an investigation from resource-intensive SQL queries in Postgres? How do we navigate backwards from SQL into the application?
Fortunately, we can enrich all (successful) of our SQL queries generated by the application, with the application context above.
From Rails 7.1 onward, we’ll leverage the Query Log Tags feature to do that, while earlier versions of Rails used the Marginalia gem.
Since Rails 7.1 Query Log Tags comments are based on a standard called SQLCommenter which has implementations in other languages besides Ruby as well.
By using standards-based SQLCommenter comments or “tags”, this information can be parsed by tools like PgAnalyze or GCP Query Insights, visualizing application context alongside SQL queries. With SQLCommenter, we’ve solved another one of our sub-problems by adding application context to our SQL queries.
Be aware that these tools are pretty good out of the box, but it’s likely you’ll need to customize them to cover all application queries. Also, since we didn't have these annotations set up before, we can't use them now for our investigation. However, this is a system improvement we want to add for next time.
Let’s switch to another sub-problem. How do we break down the execution of specific SQL queries, getting deep insights Postgres-level execution? To do that, we’ll need another tool.
Breaking down query execution
To analyze SQL query execution, we’ll collect that information from Postgres. We’ll connect to Postgres using rails console
, rails dbconsole
or psql
, where we have access to EXPLAIN
and EXPLAIN ANALYZE
either from Ruby or as SQL.
From there, we'll manually capture a query execution plan from Postgres for our SQL query. We can see both the query plan Postgres expects to do, or we can fully execute the query and see the execution information added in by adding the ANALYZE
keyword.
Query execution plans are rich in information. They show which tables and fields were accessed, whether indexes were used, and lots of details about the costly components.
Let’s return to our slow product inventory SQL query. We want to see how the products table was accessed. Without indexes, PostgreSQL would need to read all table rows in order to filter them down, resulting in a sequential scan. We'd see the sequential scan in the plan. Sequential scans are acceptable for smaller sized data, but as row counts grow, they are no longer optimal and add too much latency.
We're wondering if our product inventory SQL query used a sequential scan, index scan, or something else.
Interpreting and taking action on query execution plans is a whole separate post on its own but we’ll briefly cover some basics here. Imagine the product team has let us know about a recent product change to sort products differently.
When customers review or rate a product, a last_reviewed_at
timestamp is set, and this recent product change leverages this field.
From the EXPLAIN
analysis of the product inventory SQL query, you determined that the ordering of products had a high cost by studying the cost of the ORDER BY
clause. You've realized there's a connection here between the product requirement change and this query execution.
Since you knew that indexes can support ORDER BY
operations, you tried a possible fix by adding an index that covers the last_reviewed_at
field in a pre-prod environment.
With that index in place, you collected another query plan. The query runs much faster now! The index reduced the cost for the ORDER BY
plan component. Alright!
Since this is a Ruby on Rails application, it’s time to get your fix into production. Generate a new Active Record migration file that has an add_index
and use if_not_exists
to create the index in all environments where it doesn't already exist. Remember to use the CONCURRENTLY
keyword since this index is being added to a busy table. Imagine that you've shipped this index, and are back to monitoring the performance of the products inventory API in your APM once again. Everything is looking good.
With our sub-problems and our original query performance problem fixed up, let’s recap what we’ve done.
Wrapping Up
In this post, you saw some ways to take your performance analysis beyond the APM for Ruby on Rails and the PostgreSQL database, leveraging additional open source observability tools, wiring them together, and using them as part of your comprehensive analysis.
Along the way, you're improving the end to end visibility of query performance both from the perspective of your application, and from the database.
To get there, you added the pg_stat_statements
(PGSS) module, application context SQL query annotations based on the SQLCommenter standard, and the Query Log Tags feature in Ruby on Rails. You saw the basics of using EXPLAIN
and EXPLAIN ANALYZE
in Postgres to break down query execution to help understand which parts were costly.
You explored a common slow database query performance issue, due to high latency from excessive filtering, using a fictional ecommerce Rails and Postgres application. You saw where APMs were helpful, and what their limitations were.
By equipping application developers with deep database insights, and comprehensive analysis tools from the application to the database, they’ll be able to more quickly and confidently diagnose and fix performance issues for critical areas of your software platform.
If you're facing performance issues with your Postgres + Ruby on Rails application and want to build a more scalable, efficient solution, here are a few ways you can learn more:
- Book a call to learn how we can best help level up your Rub on Rails application today
- Engage our team in an extensive review of your Rails setup and receive a customized optimization strategy