Scaling Rails & Postgres to Millions of Users at Microsoft: Lessons & Takeaways

Scaling Rails & Postgres to Millions of Users at Microsoft: Lessons & Takeaways

Do you have a Rails app built on PostgreSQL and need to scale it to millions of users? In this interview, I speak with Andrew Atkinson, one of StepChange's expert consultants, who brings deep expertise in optimizing Rails applications for performance and scalability.

Andrew’s expertise is backed by real-world experience—during his tenure at Microsoft, he played a key role in scaling the infrastructure for Flip (formerly Flipgrid), a video discussion and sharing app built for classrooms and beyond that Microsoft acquired in 2018, scaling it to support tens of millions of users.

Andrew is the author of the bestselling book High Performance PostgreSQL for Rails, published by Pragmatic Programmers in 2024. He is a frequent speaker at industry conferences, including PGDay Chicago and RailsConf, and has appeared on several technology podcasts. 

In this interview, Andrew shares valuable tips and lessons on scaling your application to handle millions of users. We cover key topics like fine-tuning ActiveRecord queries, maximizing database efficiency, and integrating Sidekiq for high-throughput background job processing. 

If you need help optimizing your Ruby application’s performance, our team at StepChange includes some of the world’s best experts. We specialize in Rails performance and scalability, ensuring your application can handle growth and deliver exceptional results. Contact us today to start solving your performance challenges.

Full Transcript

Harry Tormey:

Hey everyone, my name is Harry Tormey. I'm the CTO of stepchange .work. We are a boutique consultancy with the world's best experts solving application and database performance issues. Today we're going to be talking to Andrew Atkinson, who is a Postgres expert who works with StepChange on some of our commercial engagements. And today Andrew is going to be talking to us about the world of Rails, Postgres performance and

all of those sort of good things. So Andrew, over to you. Why don't you introduce yourself?

Andrew Atkinson:

Thanks for having me on Harry, excited to be here. Yeah, so I'm Andrew Atkinson. I've been a long time software engineer, more than 15 years. And the last five or six or so, I've been really focused on Postgres as a specific relational database, but really from the perspective of usage with applications, having worked at different companies and on different teams, I felt like there was

kind of specialization area that was often needed as companies reached a certain level of scale where, you know, it would be interesting to offer that on teams I'm part of, or now I'm offering it as a consultant as well. And it's been a pretty fulfilling process to go through and work on building these very specialized skills. And then, you know, meeting with different companies through this consultancy practice that I'm...

performing now for different companies. But yeah, I guess like, you know, my, my general background as a technologist is having been a application developer, mostly web applications with some of the bigger companies have been Microsoft and Groupon, and then a lot of startup companies, lot of kind of venture backed internet scale companies that are in a growth mode, and then going through that journey a couple of times at different places.

Harry Tormey:

That's a really awesome background, like I mean, and it's great that you've worked on Rails performance issues at Microsoft, as well as kind of early stage startups. So tell me a little bit about how you first got into database performance issues. How did you cut your teeth, Andrew?

Andrew Atkinson:

Yeah, I've since learned the term accidental DBA from charity majors. And I was on a small team where we were running into, we were kind of looking at performance holistically as we were scaling up. That was, you know, things outside of Postgres and the application as well, but it could be other databases or, yep.

Harry Tormey:

And just as sorry to interrupt you, but what was the application stack that team was working on, Andrew?

Andrew Atkinson:

Yep. Yeah. That was a Ruby on rails monolithic application running on a post -gres 10 and we were deployed on AWS with elastic beanstalk and RDS. And, um, the, where I really cut my teeth was there was this opportunity. There was both a need and an opportunity to really help the team in this specific way as we were all kind of scrambling with, um, past

load that we had experienced and then knowing that future load was coming but in a quiet period. And the reason that that's not usually the case, know, usually you don't have that sort of up and down pattern, but in this case the reason was because it was an educational app used in K through 12 schools. So there was big load during the spring semester or the, you know, the spring time period before school was out and then almost nothing at all in the summer time. And that's when I was primarily performing this work.

but we knew that we were performing it because in fall we expected a huge, you know, crushing load of traffic again. So our pieces that didn't scale well, we're gonna not scale well again. And so we...

Harry Tormey:

Seasonality is such a huge thing with educational software. And could you tell us a little bit about the read and write characteristics or other interesting things about this educational application, just so our viewers understand and have a mental model of what the database's needs would have been?

Andrew Atkinson:

Yeah, definitely. And, the, did actually do some product work. we didn't, we kind of had this, we were about five or six total engineers on that team. So we didn't really have divisions like an infrastructure team and a product team and that sort of thing. that point later on, as the team size grew, there were those kinds of divisions, but I was mostly working as essentially an application DBA in that summertime period,

I wasn't working super efficiently because I was learning a lot and I had this opportunity to learn and apply what I was learning there. But going back to the way the application worked, the reason I was mentioning all that is because later on I actually did more product development, but I kind of went back and forth a bit, which is actually how a lot of my career has gone is I've kind of gone back and forth a bit between application development and then more of an infrastructure focus. And the application was...

A video sharing app. So sort of like tick tock or something like that, but specifically for use in a imagine one that's designed more for kids. So it's going to have like a really fun and silly UI, you know, like a lot of stickers and effects and things like that, which I guess tick tock does too. But, and, but it's going to be, and then it's going to be used in a real specific way that more meant to be a private, somewhat private interaction between like a teacher and educator and their students. And it would augment the classroom learning. So it would be

creating a video about a topic, asking for video responses from the students, creating some engagement, and being able to do that using the internet and using mobile apps. And so the patterns then were, of course, writing in a lot of video and metadata types of comments, data about the video. And then a lot of, I would actually classify it more as kind of a social media style app where

A lot of the problematic patterns that we ran into were social media interactions. like likes and views and comments and that kind of thing. And making sure that interaction, which was coming in really fast all the time, was performing well. So it was efficient with millions of interactions. And

Andrew Atkinson:

So yeah, lot of time what we, lot of the way that sort of shook out, which I can get into more is with relational databases. The challenge there is relational databases provide a lot of strong guarantees about your data and about the transactions that you're sending it. And so there's a balancing act there where the more guarantees that are provided can impact your concurrency capabilities from your client application or how you use the database. So we were looking at things

better understanding the impacts of multi -version concurrency control when we just update a column on a row. So like this particular video, it's relational database table row has a view count. We want to just update the view count. Turns out that's pretty hard to do when you're doing that millions of times in a short time period. And so, yeah, just learning about why that is in Postgres, what our alternatives

how we can optimize in place, those sorts of things were a lot of what I was focused on at that time period.

Harry Tormey:

So okay, to kind of read this back to you, Rails application, probably ORM is a very compelling point about this from an application developer's point of view, right? Like you get to

write code really quickly without thinking too deeply about the database. One of the use cases in an application like Twitter or YouTube is how many people looked at the thing. In order to keep that updated and accurate, you have to do a lot of writes to the database. That may not be apparent to the application developer. Can you talk us through maybe some features from an application developer's point of view that would be

problematic, right? Situations that you found yourself in when you were learning about all of this stuff where maybe you rolled out a feature and you're like, shoot, I didn't put this thing into the schema. This isn't a document -based thing. And maybe we don't have, I don't know if Postgres had their JSON thing back then, but we need to iterate in the schema. Can you walk me through a few of those scenarios that would screw up a application developer? Our application developer might not be thinking about that ahead of time when they're building their feature.

Andrew Atkinson:

Yeah, it's a good way to think about the tension, I guess, between kind of the application developers' needs and the persistent store. Like the data store is all about, I'm making guarantees about, you know, if you send me data, like, I'm going to go through lots of hoops to make sure that it's gonna be there, you know, even if the database restarts and, you know, I can give you lots of things like the ability to replicate it to other instances and things like that.

But those are all, it's a different set of problems being solved than I need to build a video feature and wouldn't it be cool if like at a certain timestamp I could leave a comment or something like that. You know, that's a whole different set of problems that a product developer is looking at. And yeah, with an ORM, which is Active Record in Ruby on Rails or really any ORM, the idea is it's an abstraction layer on that interaction with the database as we send, you know, read and write queries to

relational database and by design so that the developer the product developer doesn't need to think about or even really know much about how that works and that actually works really well for a long time like Postgres other relational databases their general -purpose relational or they have a relational model, but their general -purpose data stores and What I've seen is And what I talk about

some clients that have hired me is understanding, like you said, thinking about some of the ways that certain features might play out in a non -optimal way in a relational model. And actually that's more of what your question is about, so let me go into that first. So I would say, know, there's a combination of things to look at. One is, first of all, is kind of just the instance that,

is serving the traffic, right? You typically, sometimes it kind of boggles my mind how we generally have one primary Postgres instances, Postgres instance that's serving all of this traffic, all of this write operations, concurrent write operations from lots of different users as they're generating activity from an application, creating objects, metadata, comments, you know, just doing things basically, that all needs to be persisted. And then simultaneously we're serving

Andrew Atkinson:

the most up -to -date version of whatever's happening out to possibly, to likely many, like a multiple, sorry, what's the word I'm looking for? I was talking about, well, I wasn't saying we're serving data out, but at a order of magnitude higher typically, right? Like we commonly have something

Harry Tormey:

Are you talking about having multiple read instances?

Andrew Atkinson:

can be like a 10 to one ratio of read activity to write activity is not uncommon with a lot of web applications. So we need to serve this data back. And this is all, you know, typically one server instance, there might be read replicas and other things to split traffic out to.

Harry Tormey:

So from just to kind of pause you there, so from a product developer point of view, because especially with early stage startups, you know, after the thing has been built, it's easy to go back in retrospective and say, we've done all of these experiments and we now know that the product is going to look like this. But when you're actually starting to try and figure out the sausage recipe, you're iterating and doing all sorts of crazy stuff and it's total chaos and it's hard to have a sort of defined thing, especially for consumer applications, which I've worked on.

Andrew Atkinson:

Yep.

Harry Tormey:

To optimize. So to give people a mental model about when they're starting to enter into the danger zone, would you say typically it's things that are very read and write heavy, right? Say maybe an interactive chat app that would cause performance issues easy versus things that are maybe more read heavy. That's kind of less. Is it kind of like read and write or can you get into trouble with both? What would be from a feature point of view, canonical things like posting on a forum or just looking at like static

that's being rendered but maybe has to query the database because it's a DEMEMIC. Is there any patterns in product development that you could kind of just give people a mental model of like, you're starting to put out in this new feature. This might be a little bit tricky. You might want to take a step back and kind of talk with your CTO or some more senior engineers about like, is this going to be an issue? And should we do some sort of more cautionary rollout?

Andrew Atkinson:

Yeah, it's a good question. And I guess I've been so focused on typically starting from Postgres and then kind of working back towards the application that I don't tend to think about it as much these days. But I do think, I mean, in terms of categories applications, real time applications, there certainly are examples where folks do use Postgres for, you know, capturing kind of like very high ingestion rate stream of changes, you know, like

you might find with a real time application, like a very heavy right pattern. But I think doing that again with Postgres being a general store, doing that without pretty significant knowledge of the ramifications of that in Postgres is not going to work very well. It's possible to do, but I think you need to really understand like essentially the operational model for heavy inserts and then, you know, maintenance and a bunch of other things. So

I tend to, most of my career, a lot of my career has been on like, I like the term information systems, but you know, you could call them often, or maybe it's because more recently I've worked more on multi -tenant kind of business to business apps that they don't really have these huge spiky patterns that like social media style apps or maybe real time types of apps might have, or know, flash sales or things like

When you are dealing with those are some of the, guess those are some of the categories where the relational data model is going to be, you're going to likely run into problems faster. And I think a lot of it is the concurrency and the contention for, shared resources. And. You know, as, as someone who has put a lot of energy into like how you can better leverage Postgres to handle those situations.

Where my brain goes is just like, just run two different instances, you know, like segregate your traffic. there, there's sort of like, as again, as a general store, I think the fun part about, you know, being kind of solutions focused with postgres is there's just so many things you can do with, running different instances and, configuring them differently, like different, you know, there there's another, I can talk about that more, but the way that a lot of, a lot of, client companies, I help.

Andrew Atkinson:

are using Postgres or a lot of teams have been part of, they don't have that kind of mindset or that sophistication level to say, certainly not to start out that way. So it's often about like, okay, well, where are things at now? We have these application patterns, heavy writes and reads to the same table, thousands of concurrent clients accessing it. We're running into...

Andrew Atkinson:

You know, and then meanwhile, developers are creating new features and modifying the table in place, like with schema evolutions that are kind of rolling out and being deployed and we're a modern DevOps organization, so we run multiple releases per day. Those are usually the circumstances where I tend to get involved, I think, and trying to juggle those things and just kind of accept where things are at at the moment.

Harry Tormey:

Mm -hmm.

Andrew Atkinson:

Helping clients show maybe what things could look like in the future, but really it's usually kind of addressing the current, you know, one instance serving a lot of different scenarios.

Harry Tormey:

So that's going to take.

a step back and talk about maybe one of these scenarios, because I think it's very sort of applicable to both the B2B use case that you talk about as well as maybe the more consumer one that suddenly is being featured in Hacker News or is on CNN and is getting tons of traffic coming in, which is the sudden spikes of traffic. And maybe talk through where does a Rails monolith on Postgres start to break and what are the things that you can do to sort of break

out into maybe read and write instances and get the most bang for your buck when RDS Talk Me True, that sort of scenario when you've gotten brought in to deal with that.

Andrew Atkinson:

Sure, yeah, yeah, some of the categories definitely are connections from the application to the database. And then looking at the operations, like the right operations and a certain path for optimizing those, which could be moving if the tables are very large, moving to partition tables. And that's because you can have less contention because you can have different buckets. The way in Postgres with...

Partitioned tables you just have more tables. You just divide up a big table into a set of smaller tables And then on the read side again looking at the operations the read operations that's a there's a pretty well -known path there of Creating a read replica and in postgres you can actually do a couple of different types of replications Techniques you can do physical replication or logical replication with physical you just get a full copy of your server instance running in a read -only mode

Harry Tormey:

Mm -hmm.

Andrew Atkinson:

So as long as your application framework or another library that you use allows you to do this, can split any of the read queries to that separate instance. And then you can separate the load that would have been on that primary instance out to its own instance. So that could be, you know, if you've got a 500 gigabyte Postgres instance, like it's memory that's available, you could have a second one.

that's solely for the purpose of reads. And there's some caveats to that, of course, with replication. It takes a bit of time to send a lot of bytes of data over the network to another instance. But there's even some modern capabilities from different cloud providers where they've nearly removed replication lag from the equation with separating compute and storage. So I've worked on a team that used

those kinds of technologies where we could, we could more easily separate our read traffic from our write traffic. And then I think there's a lot of interesting performance. so as far as like, I guess, let me, let me go back a bit to your original question about the, some of the points where you run into issues. The connections when I mentioned was typically the application will have

database connections to Postgres are fairly costly. They have a process forking that needs some instance resources, some memories allocated to that process. because of that, there's some latency incurred whenever a query needs to run and a process is forked. And to get around that, then we can keep kind of a set of connections open and in an idle state that can be reused.

Harry Tormey:

going to have to jump in here and ask a question and thoughts on PG bouncer or things that should people should be thinking about when they're managing these expensive connections and maybe stuff about limitations on that that people should be mindful and aware of.

Andrew Atkinson:

and and that yeah go

Andrew Atkinson:

Yeah, for sure. Yeah. so typically the, it's, applications don't, at least with Postgres, as you're scaling up, you don't really need a connection pooling software or connection management software outside of what's provided in the framework. Cause it does its own kind of connection pooling and reuse. But as you're, as you're nudging me towards commonly, again, because these connections are costly and if you have a social media type of app, you're going

It's not, you know, it's likely that you could get into the thousands of concurrent connections that are actively performing work. there's a, you know, PG bouncer is one of a set of, a set of softwares that are called connection poolers and they act as kind of a proxy layer between the application and the database and they can more efficiently reuse connections by kind of pumping more transactions through them possibly, but there's a bunch of caveats. like with PG bouncer, there's several,

pooling modes that it makes available and there's a tension between getting more transactions processed on the same set of connections but taking away some of the features that you might need on your client application side. So things like prepared statements or advisory locks or some of the things that you might otherwise be able to use you can't use if with certain pooling modes with PG bouncer and

That's another piece of software you need to run as well. So that's a bit, a bit of work to set that up. And for that reason, some of the cloud providers either offer a variation of Postgres with a connection pool are built in, or they do offer a connection pooler as a service that you can add. That's of course going to be configured already for, you know, without network connectivity issues to your Postgres instance. And that's going to allow you to then reach greater numbers of concurrent connections without

running into issues that you might run into if you connect it directly to Pulsegres.

Harry Tormey:

That's a really good answer, Andrew. So before we can resume the conversation, top analytics or telemetry tools that people should configure, especially if they're thinking about bringing in an expert like you to help deal with the issues that they're having with their application and database performance.

Andrew Atkinson:

Yeah, for sure. Yeah, there's reactive and you know, I think the, know, it's a cliche, but the crawl before you walk sort of concept. I think initially it's a reactive process from writing your application code and building your features that attracting usage, know, usually, usually clients that come to me, they are successful, which is great. Their business is growing and they have users and you know, a of new startups like

That's what they're fighting against initially is just no one knows about our product. Where are our customers? Like how do we sell to them efficiently and all that stuff? Like that's all, that's your first challenge to solve. But if you've reached that successful point where you have usage, then from a reactive point, the very first thing to try to get in place in my recommendation is, or my opinion is a PG stat statements, which is a well -known it's

Deliver it as an extension, but it's part of core Postgres, so it's called a module. But the idea is that you have all these queries running. The idea is that it provides you visibility into what are our most resource intensive queries, or queries like accessing and writing the most bytes of data, returning the most rows, that sort of thing. So it's really critical to say, like, where should we be spending our time?

And what you can do then is you can as use SQL to then query that you enable it, it starts to capture your query statistics. And then you can use SQL to then query your own statistics. And you can say, let's look at what our most problematic queries are. That's, that's definitely a reactive approach though. But what you're doing is you're trying to identify where to spend your optimization efforts. And then once you've identified that, another thing I recommend is there's an,

What's cool is there's a standard around this now, but there's a standard called SQL Commenter with support in a variety of frameworks and languages. Here we've looked at Python and Django. There's also support in Ruby and Rails. There's support in Go and some other languages and frameworks. And what that does then is as your queries arrive to Postgres, before they're sent from the application, you can kind of annotate or decorate your SQL query text with.

Andrew Atkinson:

information about the application. And that's going to be super helpful when you start from a reactive approach and you look at queries to know where your application code is that generated those queries.

Harry Tormey:

super super super important because if you're not a database person and you're used to basically dealing with users dot object that has god knows how many joins or whatever in it or other objects that are causing problematic query patterns being able to connect what you're seeing in the telemetry to the application business logic is super critical so please do continue and do that's an awesome point

Andrew Atkinson:

Yep. Yeah. And really, really before, you know, query optimization is, I, you know, over now seven or so months of consulting, I, I have really gained a bigger appreciation for how challenging and, and in a sense, costly query optimization can be like when you actually are faced with here's the query, we can't change it. Here's the data. We can't change that. Of course.

It's it's not easy to do even as someone who really focuses on this a lot and that's not to say that I can't do something quickly, but it's it's hard to make as Where am I going with this where I'm going with this is sometimes it's you can make more impactful changes by going back out to the client application and saying What can we do here? Right? And so a lot of times for example, if you have a very costly

or a very costly operation that's reading some data, it's maybe doing a count of rows or it's just returning a huge result set or something. Before we even go down the path of what is arguably costly query optimization, can we make a trade off of read versus write and say, let's go and keep a counter cache of the counts of this particular query? It's very inexpensive to store this data.

It's inexpensive to compute it offline outside of a user's request. And then what we can do is when we query that, it's super fast, of course. We're just reading one column from like a table row.

Harry Tormey:

And just from an application point of view, do you really need to know that 1 ,523 ,600

and two people were actually watching the view or does it need to be roughly accurate is always kind of a question. Can you get away with not actually touching the database? I suppose a question that I have, because I'm not as familiar with you with Rails is how hard is it to change the configuration of your Rails app to take advantage of a different database setup? Like you're going from, we have one Postgres server, it lives here, to hey, we've done some analysis,

Andrew Atkinson:

for sure.

Harry Tormey:

is going to be read replicas and this is where you go to write. How hard is it to configure and what do you do to do that?

Andrew Atkinson:

Yeah, well, the good thing is that and part of why I think Rails has continued to stay relevant and successful despite being created about 20 years ago is it is in use at high scale organizations like GitHub, Shopify. And we've benefited as Rails developers from lots of big things that frankly aren't really needed at smaller scale, but are needed at bigger scale. So for example, a tiny one based on what we just talking about is

if your active record code is There's a couple of different active record methods to get the count of a set of things So you might put a sequel query together although with active record where you do, you know You read you read a particular table you join to some other tables You do some filtering on columns within those tables attributes in your object model and then you do a count of those things There's this cool If you if you're willing to change

or use from the beginning this size method instead of count. What that does is it lets you later on choose to use a counter cache feature that's built into Rails where you just slap another column on your table that's a cache column and the size method is counter cache aware. So if you just add that data and you populate it, then you can take advantage of no code changes to your application code and you can.

save yourself what would be a costly read query by making a very low cost read query, for example. A bigger one, a bigger change would be what ActiveRecord provides for both segregating your read and write traffic to different instances. As of ActiveRecord 6, which came out in around 2019 or 2020, they support multiple databases and then that's been leveraged in a few different ways. So we can set up writing and reading roles.

and we can configure the application models to use one of those roles, like writing and reading, depending on what it's trying to do. But even beyond that, there's automatic role switching, where if you know that you're accessing a read -only endpoint, like you want to read data and you're hitting a get HTTP endpoint, then you can automatically have

Andrew Atkinson:

the application code will know to go to a reader instance if it's been configured and and there's some configurations options in there like allowing some replication lag and that sort of thing and that can that can spare you from having to make a lot of application changes but then getting that benefit of read and write segregation and then the last thing that's basically the same as that very similar to that is again with multiple databases and these are logical databases that could be on the

physical server instance or separate instances. ActiveRecord added sharding support and that's going to be application level sharding. But let's say you have very heavy write traffic and you want to run a couple of Postgres instances to segregate those writes. Then from the application level, you can set up configurations to map application models to those different instances and you can call them different shards.

It's a generic mechanism so you can kind of do different things. But one pattern there I've seen where that's really useful is again with the B2B business type apps. They might have the exact same schema design, know, so let's say it's like a time tracking software or something. You know, it's got a table to store like the hours that you've worked in and whether you've created an invoice and whether you've received a payment and whatnot. There might be the same schema design.

from customer one and customer two, but customer two might have much greater data volume and query volume. You can use the horizontal sharding feature in ActiveRecord then to give each of those customers their own Postgres instance. Maybe they have small instances, but they can then be scaled independently too. So you could say customer two, maybe they need a lot more CPU, a lot more IOPS and that sort of thing. that's a capability you have in the framework, which is pretty cool.

Harry Tormey:

That is really cool.

So Andrew, let's say I'm an application developer working in a Ruby shop and I've gone in and I've looked at and identified the top seven queries that are causing performance issues. And I want to go take those queries and try and figure out what in the application business logic is causing these problems and then fix them. Can you maybe walk me through some ideas or suggestions for that person that's trying to make that journey?

Assuming of course that they've done some of the things that you've talked about setting up PG stats this telemetry Tagging all that kind of stuff

Andrew Atkinson:

Yep. Yeah, I think the, you know, going back to what we said before, sometimes there is more of a way to solve the problem without technology, but more with conversations with the product management or the product stakeholders. And, you know, for example, there might be a screen that is showing a bunch of things, but there's no pagination. And we can go and say, like, can we introduce pagination here?

we're just gonna show 10 results. The user's gonna have to click a little bit more if they need to get more data, but that's gonna allow us to deliver a much more reliable and scalable solution as data volumes grow and query volumes grow. And that sort of mindset, there's lots of ways to try to apply that. Kind of scrutinizing what is being accessed and making sure there isn't any unnecessary tables being joined.

there isn't any unnecessary sorting and ordering. There isn't any unnecessary fields being accessed. Like one thing I did recently was working on optimizing a query where the table had something like 75 fields and it was only being accessed in this one particular place but it was called a high volume where the data from that table was serialized into a JSON response and

Harry Tormey:

Mm -hmm.

Andrew Atkinson:

When I looked at it, was only there was fewer than half of the fields were actually needed. And so we cut it down and that we were able to first confirm with SQL that an equivalent query that wasn't using select star, but was using explicit list of fields would be much faster. And then I was like, well, why? And the reason was, is because one of the fields beyond the fields that were the ones that ultimately were accessed was a Jason field with a big amount of data in it. And

Postgres is flexible in that it lets you store up to a gigabyte of data in a text field. And so what can happen is if you have select star pattern in your application, then one day another developer might come along and add a gigabyte of text data in a table and now your query that was otherwise performing fine is now quite slow because you're just moving a lot of bytes of data like over the network. Yeah.

Harry Tormey:

and serializing them into a JSON object. I mean, that's a really cool example of something very simple, just looking at what you, what are you actually using? Because again, if you're a startup and maybe you've kind of rammed all of this stuff into an object that actually after all of the ABE tests have been resolved and all of the kind of user feedback sessions, it's actually just these couple of things that we need. And we forgot to get rid of that, but hey, loads of people are using this feature and there's this just

and

Harry Tormey:

thoughts there.

Andrew Atkinson:

Yeah, for sure. indexes are, indexes for me remain like one of the most fascinating things about working with relational databases. They really just are meant to improve your read operations. They give you this more efficient to access data structure for your read operations compared with just reading all the rows on a table and trying to do an assort with kind of unordered data that is accessed, it is accessed sequentially, but it's

giving you a precise way to look up the data that you need, which is what indexes do. So the trade off there though, again, is it's a challenge to make a targeted index. So a lot of times what teams do is they do a best effort attempt at creating indexes. This is my perspective that I've seen, I guess, over the years. People try to say, they make sensible decisions, like this query

there's a few rules of thumb for index design. know, if the query is, if you're accessing a table, but you're accessing a portion of it, so you've got a where clause in your query, it's sensible to have index where that same column that's in the where clause is defined as a leading column in the index, for example. So sometimes with those kinds of guidelines, people might just start adding indexes. it's not through bad intentions, but I think

What I try to get clients to be aware of is to really have good visibility into whether their indexes are being used. And often what happens is they don't necessarily have that visibility. And I think a lot of folks on product teams as well, again, it goes back to what we've been talking about a lot in this show that product and application developers are working with an abstraction layer.

they're trying to make their queries faster, adding indexes that they think will be helpful. And they might actually even be, and the other reason that makes this difficult is indexes, Postgres is making a runtime decision as to whether or not to use an index. And it's often comparing what's being requested in the SQL query. It's not often, this is what it's doing. And it's

Andrew Atkinson:

What I meant by often was it's often choosing between alternative plans and seeking out the lowest cost plan. And depending on your instance resources and your data distribution, so values even within a particular column, for example, can be significant. On one instance, like let's say your staging instance, your non -production instance, you might not have indexes picked or you might have indexes picked. And then you might have different behavior in your production

So that kind of like non -deterministic, I try to keep that in mind because I think that, and try to use that as like a source of empathy in a way that like, it is hard. Like you need to have good visibility to know that your indexes are used. And really where you care most is production. You want your queries to be efficient in your production environment. So with all that lead -in context in mind, what I try to do is,

get folks to know, or commonly what I see is there are a lot of unused indexes. So again, people were making their best attempt at optimizing their queries or even adding indexes they thought would be used based on some features that were coming, you know, or applications evolved over time and queries that did use indexes before no longer using those same indexes, or those indexes have been made redundant by newer variations.

Harry Tormey:

Mm -hmm.

Andrew Atkinson:

all of these sorts of reasons contribute to like, it's not unheard of for me to see like 50 % of the total indexes are just not used at all. And the impact of that with, yeah, with Postgres, the problem with that is then we're unnecessarily slowing down all of our write operations anytime those fields within that index are being touched because they need to be maintained in that index. And then we're never using those indexes. They're never being read.

Harry Tormey:

So just, yeah, go on, continue.

Harry Tormey:

So just to pause you, because I think there's an important maybe basic, maybe too basic point, but can you explain what an index is and why it makes looking up information faster? And the implications of if you have an index and you're not using it, why this slows down the database? Just so people have a bit of a clearer mental model and we're explaining from first principles.

Andrew Atkinson:

Yep. Yeah. So the way that the simple way that I try to, that I think of indexes, cause again, I'm not, yeah, I'm not, we're not going to get into the, you know, the B plus tree design and stuff like that. But I think of it as like, we're, basically lifting out, let's just say a single column index. We're filtering on the users by their email address. We have a user's table with tens of millions or more records in it. We want to look up a single email address.

The users table has a lot of other fields in it. It could even have 75 fields of data. So one option is we could read all of the fields of all of the rows and then we could essentially in memory work through that list and find it. And the index, primarily one of the benefits of why it's faster to access data is it's really just gonna have the email address in it in this example. It's not gonna have all these other fields.

So right away we have much less data that we're gonna access. And again, we're trying to, with the relational database, we're reading data off of a disk of an operating system somewhere. So at the end of the day, there is like hard disk drive or disk drive access speeds that is like one of our fundamental limiting factors. then, yep, yep. And then the other reason is because the...

Harry Tormey:

and paging into memory.

Andrew Atkinson:

the heap access or the table access isn't this sequential or isn't this ordered representation of the data we're trying to access. So with indexes, it's an ordered representation designed for access. thus, anytime we access data, want to, again, we're always trying to do that as efficiently as possible. We want to try

Essentially make a judgment call as to what's the most? Distinct piece of data we're trying to access commonly with index design. That's what we're trying to do is say what what is what amongst all of our table rows like What are the conditions for the smallest proportion of the data? Because then if we can put that in the index we can just laser focus in on exactly what we need and The trade -off though, of course is the index

at right time. So once that original user was written in, we're paying that cost of having their email address put into the index. So

Harry Tormey:

Well, yes, it's computing and also space, right? You have to kind of create the index that presumably takes some amount of operations and then store that index somewhere.

Andrew Atkinson:

Yep. So, so when we're, when we create indexes, we're really just creating a definition of an index. And when we do that initially, Postgres is going through and gathering up all that data for us and creating this separate data structure in its own separate file. that again could be accessed and put into memory and then it's accessed ideally from memory. But, we are creating that separate thing and there's a cost to that. So there is that right time cost. And then there's that space cost.

And then there's also this other subtle one where having those indexes in our memory, which is also a finite resource, means that something else can't be in there. Because it is a set of data that, like, and what I'm talking about here is the buffer cache and Postgres. And so we have a finite set of memory for our frequently, like our regularly accessed or frequently accessed data. We don't want to have content in there that is not useful content. That's not based on our read patterns.

And so, yep.

Harry Tormey:

So if you have lots and lots of unused indexes, maybe this is a function of ORAM or product development that didn't go anywhere, would it be fair to say this can create a scenario where it slows down holistically the entire database instances? Is that kind of how it manifests from an application point of view?

Andrew Atkinson:

Well, it is tricky because I wouldn't say, think that's too broad of a stroke to paint. I would say that it's more like, you know, table by table kinds of operations, I guess. If you have a particular table that you're accessing frequently, that's also being inserted too heavily or updated or deleted from heavily. That's where unused indexes are gonna add unnecessary IO and latency to those operations.

And the other way that's a little less immediately a problem, but can be a big problem is it's not unheard of again, if you have, you know, 20, 30, 40, or even 50 % of your indexes and you have a large database, let's say you've got a hundreds of gigabytes or terabytes sized database. That means you also likely have hundreds of gigabytes of index content and that's going to slow down your,

backup and restore time and your bills, yeah.

Harry Tormey:

And bills. Yeah, because can you briefly maybe segue a little bit in terms of for people who are application developers or maybe not looking at the bottom line, like how can that get really expensive with, say, AWS and the way that they do billing? Is there anything that you want to call out there?

Andrew Atkinson:

Yeah, for sure. Yeah, I guess I've learned about this as I've worked more on infrastructure teams and stuff, but typically the old, the older way of doing it anyways was you would provision your instance and you choose your storage upfront and you start paying for that. I think that's fair to say. so you might say, well, we need a terabyte of space and our database keeps growing and growing and growing and oops, now we need two terabytes of space. So now we're going to start paying for that. There are some other cloud offerings that give you,

price breaks up and down and that actually can be really good too to also as a reason to remove unused indexes. Let's say you can take off 300 gigabytes of space from what you actually need. know for example, I presented on this in the past but with Aurora, you pay in these 10 gigabyte increments so you can actually cut your space consumption bill down by cutting space out. so I think if you're very, for folks that work solely

the DB, those are some of the opportunities. There's also addressing bloat and rebuilding indexes and that sort of thing. But I think if you're more on the application side or product side, I kind of think about it more as like our infrastructure cost efficiency, you know, and we're spending money on our application servers, on our database servers, and there's a certain cost for that. Are we getting as much as we can from them?

You know, and if there's opportunities to perform some maintenance tasks, it might take a couple hours and then you get greater cost efficiency. We're getting more IO operations for less money or the same for less money. That to me is a win as far as your kind of your operating expenses.

Harry Tormey:

And just to kind of ask a follow up question, should you, before people go and just say, I'm going to kill all of these unused indexes, is there any best practices about retiring an unused index to avoid any disruption to users? Should you be doing this at off -peak hours? What are some kind of foot guns that are potential there? Could you talk a little bit about that?

Andrew Atkinson:

Yeah, one of the main foot guns there is, is, which I've done myself is if you drop an index on a table that's actively being, written to a lot of times with postcards, we're working with locks and lock contention, whether it's the table being locked or row locks. So a drop index operation is going to take a certain lock that could then be, could block other table level operations. And you can then cause kind of like partial unavailability.

you know, or outages, but not, not for the whole application, but for a portion of it. Um, so post guys, uh, helpfully gives us this option since I think 14, uh, of dropping an index concurrently. So we always want to do that for anything that's like actually in use and busy is to use this concurrently keyword. And conveniently, the concurrently keyword is really like in my book, I wrote about this too, but

Just kind of want, think I even, there was a title or like a subtitle, like get in the habit of using concurrently as a default, you know, just like anywhere you can use it. There's several different commands that support it. And kind of as a mindset, like always think about, you know, if you have a running production application that there's going to be contention for, could be rows. And so how do we change things simultaneously with other things without, while minimizing

of locks and contention. So yeah, I'd say using concurrently is one. And then, yeah, typically, I'm a big fan of the PG analyze tool, which is a commercial SAS tool. But the approach that it takes is it scans each day for index usage data that Postgres gives us. So we can just do that ourselves. Anyone can access their Postgres.

I think it's the PG stat user indexes catalog. But you can see like are my indexes being scanned by my queries to help support read operations. And ones that aren't scanned at all are usually what we call unused indexes. What's cool about that is and what PG analyzer, you can roll this by hand yourself is if you just repeat that, then you kind of gain more confidence over time. It's, know, if across our entire application that's using this,

Andrew Atkinson:

relational database whether it's user level kind of API traffic or its background jobs or its Cron jobs or things like that if we if you know 30 days past 60 days past etc There's never been one scan of an index. We can be pretty confident that it's not gonna ever be used And we can always add it back if we need it so I think kind of getting into that mindset of like a hygiene practice of saying You know periodically we're gonna look

unused indexes, we're going to remove those so that we know that we're efficient in terms of our write operations and our space consumption. I think it's a good practice.

Harry Tormey:

That's a really good, you know, some best practices and tips. I suppose we've kind of come to a point of the interview where I want to ask you a little bit more about Postgres internals, specifically thinking about the query planner or how Postgres does queries. Because I always found myself, even if I'm not a DBA or even if I'm not a back -end person, just understanding how a system works helps me reason about how I

my applications to leverage that system. So can you tell me, give us a little bit of an overview of how Query Planner works in Postgres and maybe some things to kind of be mindful from that perspective. Someone who's an application developer who's maybe not going to be in the weeds that you've learned.

Andrew Atkinson:

Yeah, for sure. think the way you describe that is a great idea and a great suggestion. Even if folks, yeah, I think if you have an understanding of what Postgres is doing when your query is received by the planner and how it's gonna kind of break it down, then it can actually inform how you write your interaction types of code, whether it's writing or reading data. And so yeah, I think there's four phases.

But to sort of oversimplify it You know your your sequel query is generated like let's say it's generated by your ORM it's sent to postgres and then It's planned unless it's a prepared statement We'll leave that off the table for a moment, but we'll say Your your the planner or also called the optimizer and postgres land anyways is what it's gonna do is

you're writing a declarative SQL, you're saying I need to access these, this is what I want my result to look like. So it's gonna be this set of fields from one or more tables. And then because your data is probably normalized, you're gonna access multiple tables and multiple fields. You might have primary key, foreign key, joins, and that sort of thing. So Postgres is gonna break all that down, parse that query, and say what are the things I need to access?

And then it's gonna say, how can I do that at the lowest cost? So it makes a cost -based runtime decision. So I might say that, let's say you have a users table and you have a notifications table, just making these up. But it might say that for notifications, we wanna access the last notified at column and we wanna access the user ID column. I wanna find this user's notifications.

So then Postgres might say, well, on that table, then do I have any indexes that could support this query? Because I know I can access those at a lower cost. If I don't, I'm going to do a sequential scan on the table. So I'm going to read the whole table. And the planner creates this tree representation of what it's going to do, which is the query plan. The cool thing in Postgres is we get a lot of really rich detail that we can look at as to what any query would look like.

Andrew Atkinson:

how it's gonna be planned using the explain keyword.

Harry Tormey:

Is this in the logs or where does this detail show up, Andrew?

Andrew Atkinson:

This would be if you have the source SQL query, this would be using the explain keyword in front of the query, which you could do manually.

Harry Tormey:

so it shows you like the reason why it's executing this query. It gives you more context if you use that explain keyword.

Andrew Atkinson:

Yep, yeah, it shows you the execution plan based on everything it knows about at that moment that it plans to do. It could still take a different path, but what we can do is we can also see both prior to the actual query executing using explain, or we can then add the analyze keyword. So it's explain, analyze, then your SQL query. That's gonna show us afterwards what it actually did.

And what's cool about that too is that it intermixes the, or it still retains the estimated plan and the actual plan. And you can see then sometimes interesting data, like what Postgres thought the number of rows were in a table versus what the actual number of rows are. Sometimes that can be off quite a bit. And then that indicates.

Harry Tormey:

Mm -hmm.

Harry Tormey:

Why would that be off, Andrew? What would cause that to be off?

Andrew Atkinson:

Yeah, it's a good question. That is category of poor performing queries that are gonna be more related to sort of your database operational health. what, the first thing I would look at there if I was looking at that is the table is the statistics collection for the data within the table that Postgres does, which is a critical part of its query planning. It's really actually critical that that's as up to date and as fresh as possible.

and really regularly. And what can happen sometimes is that slows down and degrades. And.

That can again happen due to if the table is very heavily, if there's a very heavy amount of concurrent modifications to row data and content within the table, then.

Andrew Atkinson:

the process that normally captures those statistics will just simply slow down. And so it could be that what you need to do in that case is give it more resources. But what you wanna do to kinda immediately rectify that problem is you wanna do a, that use the analyze command, for example, on a table, which will refresh the statistics Postgres captures about the row data.

because again, the row data, and this is something I talked about in the book too, this is something again that I think is pretty far removed from a typical application developers kind of brain space or mind space, but not only are we working at an abstraction layer, ultimately generating SQL queries to send to the database, the database itself is also in charge with accessing data, but it's relying on statistics about the data.

Harry Tormey:

So if its data source is falling behind what the actual state of the world is, its planning is going to be deficient because it's acting on faulty data. That's kind of a signal that allows you to...

Andrew Atkinson:

Exactly. Yeah, just like just like you or I would if we were acting on faulty data and making a selection We're at two doors. We can go down door one or door two you know, and and if our goal was to You know find some kind I'm really going off on a weird example but if we wanted to if one of those doors had the target that we were looking for but we were You know, we had misestimated information about those the contents of those rooms then we would make the wrong plan choice and

A lot of times that is the circumstances folks start from is why is the query planner making or just my query slow? Then they might determine, this plan selection was odd. It didn't use my index. I thought it would. And then sometimes the actual root issue could be bad planner statistics. Postgres also gives us some options to. Postgres can't keep track of every single row and every single column. Well. Like.

Let me rephrase that. Postgres keeps track of samples of rows in a table, and it keeps track of statistics about things like the distribution of data within a column. And it's, again, trying to make a trade -off of, you know, just like we would, too, at making any sort of decision. You can't know everything all the time, right? Like, that's kind of the way I would maybe oversimplify

So it's trying to collect enough information that's gonna be relevant to its decision making. And sometimes as the operators, we can feed it with a little more information. We can say, for this table, I actually want you to collect more samples. Or for this particular column, I want to express like a relationship that you might not know about between two columns.

Harry Tormey:
So this kind of You know the way that you're describing the query planner has a lot of value from someone who's tuning maybe configuration variables around Postgres or you know playing around with that. Has your knowledge of the query planner and how it interacts with Postgres ever caused you to advise maybe some application level changes? Like if you're an application developer maybe not someone who's kind of going in there playing with configurations. Does this knowledge allow you to write better SQL or your sequel or figure out what's going on with the ORM. Could you maybe talk a little bit if that's the case?

Andrew Atkinson:
Yeah, I think it's really just the most beneficial and maybe coarse -grained or 80 -20 sort of way I would answer that question is just understanding what is being accessed from your table data. And if you're doing a read query, making sure that you're likely making sure that there's an index supporting that read query. And the way that you would do that is you'd look at an explain plan and you'd see

index is not being used and then once you add the index the explain plan will show you the index was used so you'd access the same data but you do it much more efficiently with an ordered data structure and something we haven't really gotten into but Postgres shows you its own storage unit is these these something called buffers so when you can

imagine that your row data which lives in a data page and then those pages can be put in as buffers into your Memory and then data is accident probably getting into too much details here I'd say that yeah, the biggest thing is basically just learning to read the explain plans for the queries that you're generating and then on the application side again kind of getting into a mindset of Okay, we're writing code here where

filtering on rows where We have kind of unbounded access to this table We don't have any restrictions like a date range restriction or a row count limit I have this query that is joining five different tables, but There's also some denormalization in effect like maybe we have a tenant identifier column on all of them maybe we don't actually need to join five tables we just need to join two and

You can see each of the costs of each of those things I just described, like whether it's kind of unbounded access versus restricted access, the costs of those joins and how Postgres is choosing to merge that data together. And the other thing I mentioned, which I forgot about, you can see what those are in the query plan as you make variations. And I think that then informs having that sort of mindset on the application side of trying to write things as efficiently as possible from the outset.

Harry Tormey:
These are really great debugging tips because you've given us the sort

mental model of like, guys, if you've really iterated on this object and you're not using a bunch of indexes, you should probably get rid of them. But then you've also sort of given another valuable tool to just have a mental model about how is Postgres using the SQL that's generated from the ORM and is it actually behaving, actually behaving the way that you expect it to. Is it using an index? And if it's not, why is that? Is the row count off? Do you need to change something about the configuration?

excellent pieces of advice. So Andrew, you have a shiny new Rails book out and it would be remiss and I want to be mindful of your time because we're over an hour here and this has been awesome stuff. Pitch the Rails book and tell us about what's in it and how you approach it because it sounds like just a mine of information that would be really useful for Ruby devs dealing with Postgres.

Andrew Atkinson:
Thanks, yeah, the book is called High Performance Postgres for Rails. Actually, I have one right next to me, right here, so I can pull it up. Yeah, it's available and it's, they're called the Pragmatic Programmers and they, I was really excited to work with them because I personally have purchased a lot of PragProg books, you know, throughout my career

Harry Tormey:
What? Live? Ooh, that looks good. And who published it?

Andrew Atkinson:
Think of them as one of the publishers of the top books in the Ruby world in particular. And there's another PrideCrog book, Seek SQL Anti -Patterns that I am familiar with and was kind of somewhat of an inspiration to work with them as well for me. yeah, so the book is, the way I describe the book is it's actually like 75 % Postgres book, maybe 25 % Ruby on Rails.

And I have had some folks tell me that they're still finding it useful even though they don't work with Rails and never have before. But they build web applications with other similar kinds of MVC monolithic frameworks like could be PHP and Laravel or Node and Prisma or something like that. And the real purpose of the book is basically to go into more breadth and depth on some of the things we've been talking about today, which is I have an application that's working with Postgres.

I'm learning more about how it works and what I can exploit, what are painful areas, how to keep things healthy, maybe some re -architecture too as I move into more advanced features like table partitioning. And that's all in the book. it's written in a, it's 15 chapters. It's written like in a way you could read it from the front to the end, like with a narrative style. But.

At the end, also am proud of a field guide section that we added in pretty late based on some feedback from earlier readers where it's like the opposite where you can just start from problems. can say, I'm having, you know, if you have certain problems you've identified, you can look in the field guide and see if it has, you know, and if there's any hits on the problems you have. And then it'll give you references back into the book to go and read about a certain

Harry Tormey:
This book sounds absolutely amazing Andrew and I just want to say thank you so much for your time. I really appreciate talking to you and I think anyone watching this will have a few handy tips to take back to their Rails or other kind of Postgres with an ORM application setup. So thank you so much. I'm going to stop the interview.

Read more