Boost the speed of your system
Notice your application slows to a crawl as you get more data in your system? You’re not alone. While we’ve all heard the warnings about premature optimization, at some point, you’re going to need to spend some time working out how to improve the performance of your system.
We recently ran into such a situation at Sky Ledge. A simple query to retrieve time series data took 30 seconds to run in our staging environment. These were simple queries with indexes on the columns of interest. We’d expect the requests to complete in <1 second, not 30 seconds.
We pride ourselves on a good user experience. Waiting 30 seconds for a graph to render makes for very poor UX, so it was time for sleuthing and database optimization.
Sky Ledge is a platform that helps users quickly create control room experiences to gain insights into their operations.
One of the foundational parts of Sky Ledge is our Asset API, which makes it very easy to track and visualize data from objects in the real world. Assets are entities within Sky Ledge that represent a real-world object (eg a vehicle). Metrics are data streams associated with an asset. Examples of metrics are speed, elevation, temperature etc. Each metric tracks a single value against the asset with an associated timestamp.
Our metrics database looks as follows (we use AWS RDS, running Postgres):
create table asset_metric
asset_id. uuid not null,
value double precision
We’ve also got an index on the
create index on asset_metric (timestamp, asset_id, name);
The problem happened when I was creating a demo for a client in our staging environment. As part of the demo, I displayed temperature data from an asset within the Sky Ledge platform.
To retrieve the temperature data, we make the following query to the database:
select * from asset_metric
asset_id = 'abc123' and metric_name = 'speed'
order by timestamp desc
This query was consistently taking 30+ seconds to complete. It’s a simple query, and I knew we had an index on this table. The cause wasn’t immediately apparent.
Connecting to the Database
The first step was to connect to the database. We take security very seriously, so both our staging and production databases are locked down within a private VPC on AWS, with no access to the public internet.
I set up an SSH jump box to connect to the staging database. See here for more details on how to do so (I wouldn’t recommend this approach for a production DB).
What’s our data look like?
Next, I wanted to explore whether the data distribution might be contributing to the problem. We had other demos on staging that were performing without issues, so the slowdown was limited to specific assets.
There are 7,000,000 rows in the
asset_metrics database. While not tiny, I’d expect better performance than what we’re seeing. The 7 million rows are spread between 106 different assets. There are also 710 asset / metric pairs (speed, temperature, elevation etc.).
Unique assets: 106
Asset / metric pairs: 710
I next ran a couple of queries to understand the data distribution. 84% of the data is associated with just ten assets in our staging environment. The dataset is very highly skewed.
Testing Different Asset / Metric Pairs
Instead of testing the query on a single asset / metric pair, I compiled a list of 9 asset / metric pairs. This would give a good indication of performance across the board. These are:
| Asset | Metric | Rows |
| 1 | speed | 169000 |
| 2 | speed | 100000 |
| 3 | speed | 75000 |
| 4 | speed | 20000 |
| 5 | speed | 10000 |
| 6 | speed | 5000 |
| 7 | speed | 2000 |
| 8 | temperature | 700 |
| 9 | speed | 400 |
We have a good spread from metrics that update very frequently to those that update very infrequently. This will help inform if the issue happens across the board or only in certain circumstances.
How I measured query performance
As I’m connecting to the database via the jump box, network latencies would affect round trip query times. Instead, I used
explain analyse to measure the time it took for the DB to execute the query. This allowed me to measure performance directly in the database without worrying about the network’s variability.
The Current Query
I ran the queries on each of the nine asset / metric pairs to get a baseline.
The results surprised me. In 8 of the 9 cases, the query planner used a backwards index scan. I had assumed it would be a sequential scan due to the time taken. Intuition is great, but it pays to confirm your intuition. Instead of wasting my time going down a rabbit hole of why the index wasn’t used, I now knew that the issue was the index’s performance.
The only time it used a sequential scan was for the temperature metric. There are orders of magnitude more speed than temperature metrics, so that’s likely the reason.
It’s already using an index scan, and Index Scans are Fast™so why the slow results for all but the most frequent metrics?
Un-reversing the Index
The first observation is that the database was performing a backwards index scan. Backward scans can be slower than a standard index scan for multicolumn indexes (which is what we have).
Given that we almost always query the latest data (with
order by timestamp desc ), it seems logical to order our index with timestamp descending. So, I created a new index:
create index on asset_metric(timestamp desc, asset_id, metric_name)
and reran the queries.
No change for the temperature metric (still a sequential scan) but progress everywhere else! We’ve substantially improved the query performance for the other slow asset/metric pairs by changing to a forward index scan…
…but it’s still terrible. Four seconds is better than 25 seconds but still way too slow for our needs. We’re all about building great experiences for our customers, and having to wait seconds for a simple graph to load is not that.
Simplifying the Index
Maybe the index is too complicated and causing issues for the query planner? I tested that assumption by creating a new index, using only the metric name and asset id:
create index on asset_metric (metric_name, asset_id)
I figured this would make it easy to find the relevant data for each asset / metric pair.
And sure enough, the results looked great! We’d gone down from 4 seconds to less than ten milliseconds, a 400x improvement!
Except for the one asset / metric pair that still took 11 seconds to return the query. I ran the query multiple times to make sure it wasn’t a weird quirk, but it consistently took 10+ seconds.
The query planner was still using the original index for the more frequent asset/metric pairs (I had removed the un-reversed index by this point so that I was only testing one variable at a time).
This wasn’t the case for the less frequent pairs. Here, it used a Bitmap Index Scan and combined both the original index and the new index.
Is there a way to eliminate that annoying edge case and get excellent performance across the board?
Improved Index — Changing Column Orders
The previous index gave a vital clue — it was beneficial to search by the asset id and metric name for less frequent pairs.
Does column ordering matter with multicolumn indexes? It does.
The canonical example is that of a phonebook. Much like a phonebook is sorted first by the last name, then by the first name. Multicolumn indexes are sorted by the first column, second column, third column, etc.
Like you can’t look up somebody by their first name alone in a phonebook, neither can you use a multicolumn index to look for a row by its second or third columns.
We’re always looking for metric data by considering the asset id and metric name; It seems logical that these should be the first columns in the index. I created the following index:
create index on asset_metric (metric_name, asset_id, timestamp desc)
Here, we both change the order of the columns to ensure the index is sorted by
asset_id. We also change
timestamp to be descending, as per our results previously. Let’s see how this performs:
Nice! Excellent performance across the board. All queries ran consistently in less than two milliseconds.
We managed to improve our queries by 15,000x in the worst case! There’s nothing revolutionary about the steps we took above. It’s a typical example of the type of DB optimizations you have to make in the real world. Modern databases can handle incredible amounts of data.
A basic understanding of how databases work can make a world of difference. Instead of choking at 7 million rows, our table should now be able to scale to 100s of millions of rows for this use case. It took me no more than 2 hours to find the bottleneck and fix it. All it took was a simple tweak to our index.