Introduction
In this post, I will share the insights I gained and the steps I followed to optimize the performance of a PostgreSQL query on a large table by creating a partial index.
I was working on improving the speed of a query while keeping my database lightweight and efficient, and I found partial indexing to be an effective solution.
Use case
Let’s assume we are running a messaging app where
users
can send messages to each others. On some occasions, our AI detection will flag some flags areabusive
and our support team need to review these messages. Support team complains the list of abusive message they need to review takes a lot of time to load.
Methodology
My approach to resolve this issue is the same it would be like fixing any performance bug:
- Confirm there is a issue - measure
- re-produce the problem in a consistent way on my local machine
- analyse, experiment, and implement
- deploy
- Check the benefit - measure.
The greater difficulty in our journey will lie in step 2. Fixing a performance issue without measuring is a mistake: often, your intuition will be wrong, and you will optimise the wrong thing.
So lets go !
Confirm
Confirming there was a problem was pretty easy. By navigating to the page using the query, I could easily verify that a loader is displayed for 20 seconds.
And a quick look at our logs and I could already see that most of the time was spend in sql.
At this point, I decided to add an extra log to production, to narrow down to the exact problematic query. This way, we can verify later that whatever optimization we came up with actually had some effect.
Finding a way to make the bug appear in local
The simplest and most effective approach would have been to copy the production data to my disk and work from that. However, the messages
table is approximately 1TB, with several indexes exceeding 100GB each.
Using the production data wasn’t feasible, I had to create some sample data.
Setting up my local database
In this section I will setup a local postgres and generate some random data using a sql query. I need to generate enough data to have the issue arise.
- setup a test db
docker run --platform linux/arm64 \
--name postgres17 \
-e POSTGRES_USER=postgres \
-e POSTGRES_DB=partial_blog \
-e POSTGRES_HOST_AUTH_METHOD=trust \
-p 5417:5432 \
-d postgres:17
- create the messages table and some indexes
DROP TABLE IF EXISTS messages ;
CREATE TABLE messages (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL,
author_id INTEGER NOT NULL,
content TEXT NOT NULL,
is_abusive BOOLEAN NOT NULL, -- Whether the message is abusive
is_spam BOOLEAN NOT NULL,
is_reviewed BOOLEAN NOT NULL, -- Whether the message has been reviewed by support team
is_archived BOOLEAN NOT NULL
);
CREATE INDEX idx_messages_created_at ON messages (created_at);
CREATE INDEX idx_messages_author_created_at ON messages (author_id, created_at);
CREATE INDEX idx_messages_abusive_reviewed ON messages (is_abusive, is_spam, is_reviewed, is_archived);
Now that my postgres db is up an running, it’s time to populate my messages
table with some data. To do that we will use a sql script, that will generate 10 millions messages.
I decided to generate messages with the following spread :
dataset configuration #1
is_abusive
: 1 in 10,000
is_spam
: 1 in 100,000
is_archived
: 20%
is_reviewed
: 50% of those needing review
sql
TRUNCATE messages; WITH existing_max_id AS ( SELECT coalesce(max(id), 0) AS max_id FROM messages ), random_data AS ( SELECT (existing_max_id.max_id + row_number() OVER ()) AS id, now() - INTERVAL '2 months' + (gs * INTERVAL '40 seconds') AS created_at, (random() < 0.0001) AS is_abusive, (random() < 0.00001) AS is_spam, (random() < 0.2) AS is_archived, 'Generated content: ' || substring(md5(random()::text) || md5(random()::text) FROM 1 FOR (10 + (random() * 50)::int)) AS content -- Random content size FROM generate_series(1, 10000000) AS gs, existing_max_id) INSERT INTO messages (id, created_at, updated_at, author_id, content, is_abusive, is_spam, is_reviewed, is_archived) SELECT id, created_at, created_at AS updated_at, id % 100 + 1 AS author_id, -- Example: author_id based on id (modify as needed) content, is_abusive, is_spam, CASE WHEN (is_abusive OR is_spam) THEN (random() < 0.5) -- 50% of abusive have been reviewed ELSE FALSE -- Always false if not abusive END AS is_reviewed, is_archived FROM random_data;
On my recent laptop it takes about a minute to populate.
- Populate with some data: 1% of abusive messages
My first approach was pretty naive:
WITH existing_max_id AS (
SELECT COALESCE(MAX(id), 0) AS max_id FROM messages
),
random_data AS (
SELECT
(existing_max_id.max_id + row_number() OVER ()) AS id,
NOW() - INTERVAL '2 months' + (gs * INTERVAL '40 seconds') AS created_at,
(random() < 0.01) AS abusive, -- 1% chance of being abusive
CASE
WHEN (random() < 0.01) THEN (random() < 0.5) -- 50% of abusive have been reviewed
ELSE FALSE -- Always false if not abusive
END AS reviewed,
'Generated content: ' || substring(md5(random()::text) || md5(random()::text) FROM 1 FOR (10 + (random() * 50)::int)) AS content -- Random content size
FROM generate_series(1, 10000000) AS gs,
existing_max_id
)
INSERT INTO messages (
id,
created_at,
updated_at,
author_id,
content,
abusive,
reviewed
)
SELECT
id,
created_at,
created_at AS updated_at,
id % 100 + 1 AS author_id, -- Example: author_id based on id (modify as needed)
content,
abusive,
reviewed
FROM random_data;
When we to try to retrieve the abusive which hasn’t been reviewed yet, we notice the query is pretty fast: 30ms. Let’s find out what is going on here, by explaining the query plan. It appears that postgres only had to examine only 189k rows.
That’s because my created data has “too many” errors row, postgres will find the 100 rows very quickly, and doesn’t need to examine too much data.
dataset configuration #2
So I decided to make the data a lot more sparse, and went for 1 abusive message every 1 million (instead of 1 every 100).
And now, running the query takes a lot more time: around 250ms. And remember, my test database is around 2GB when my production database is 1TB.
- Query plan when there is 1 abusive message every 1 million and 20 millions messages ⇒ 400ms
Limit (cost=371054.44..371054.45 rows=1 width=80) (actual time=368.576..370.384 rows=5 loops=1)
Output: id, created_at, updated_at, author_id, content, abusive, reviewed
Buffers: shared hit=192 read=286529
-> Sort (cost=371054.44..371054.45 rows=1 width=80) (actual time=363.242..365.050 rows=5 loops=1)
Output: id, created_at, updated_at, author_id, content, abusive, reviewed
Sort Key: messages.created_at DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=192 read=286529
-> Gather (cost=1000.00..371054.43 rows=1 width=80) (actual time=111.675..365.023 rows=8 loops=1)
Output: id, created_at, updated_at, author_id, content, abusive, reviewed
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=192 read=286529
-> Parallel Seq Scan on public.messages (cost=0.00..370054.33 rows=1 width=80) (actual time=135.916..342.357 rows=3 loops=3)
Output: id, created_at, updated_at, author_id, content, abusive, reviewed
Filter: (messages.abusive AND (NOT messages.reviewed))
Rows Removed by Filter: 6666664
Buffers: shared hit=192 read=286529
Worker 0: actual time=3.828..332.174 rows=4 loops=1
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.143 ms (Deform 0.099 ms), Inlining 0.000 ms, Optimization 0.165 ms, Emission 2.848 ms, Total 3.157 ms
Buffers: shared hit=65 read=102685
Worker 1: actual time=292.653..332.188 rows=1 loops=1
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.134 ms (Deform 0.079 ms), Inlining 0.000 ms, Optimization 0.172 ms, Emission 2.849 ms, Total 3.154 ms
Buffers: shared hit=64 read=102368
Planning Time: 0.177 ms
JIT:
Functions: 7
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.820 ms (Deform 0.448 ms), Inlining 0.000 ms, Optimization 0.701 ms, Emission 10.675 ms, Total 12.196 ms
Execution Time: 371.023 ms
We reach our goal: generating a dataset that will let the performance issue arise.
Experiment - analyse
In the previous section, we noticed that data scarcity has a significant impact on performance. But what causes this? What’s going on?
In both scenarios, the query planner is not using any index, leading it to examine each row sequentially until it identifies 100 rows that meet our filter criteria. Once it finds those 100 rows, it stops and returns the results.
In the first scenario, where the sample data contains a high number of abusive = TRUE
rows, PostgreSQL can still locate the relevant rows relatively quickly, albeit inefficiently.
On the opposite, in the second scenario, where there is only one matching row for every million, PostgreSQL must scan nearly the entire database to retrieve just five rows, which takes considerably more time.
Before moving on, I want to copy the database we obtained earlier, especially since we’ll be experimenting with creating indexes and other modifications. This approach allows me to establish a clear reference for my tests: one database reflecting the existing issues and another focused on performance optimisation.
Experiment 1: add every booleans in an index
A first idea would be to create missing indexes. In my experience, many performance issues occurring in small to medium sized team are missing indexes. So the first thing I would look for is a column missing an index.
In our case, every column appearing in the filter clause has it’s own index. But what if we had an index covering all the columns ? Let’s try that !
Result: it didn’t help: boolean column cardinality is very low, and index are not optimised for low cardinality. Query planner will not choose this index
TODO: Check the result. Check the disk usage.
Experiment 2: partial index
A partial index is a specialised index that will only index some rows matching a filter.
CREATE INDEX messages_needing_review_idx
ON messages (id, created_at)
WHERE is_abusive = TRUE or is_spam = TRUE;
This index focuses only on rows which are is_abusive = TRUE
or is_spam = TRUE
The filtering is performed at “write time” when new rows are inserted / updated / deleted.
Now , the query is executed in 5ms
which is 30x faster than the initial query.
It look like we have a serious candidate for our performance issue.
-> Index Scan using messages_needing_review_idx on messages (cost=0.28..1798.15 rows=802 width=83)
Implementation in rails
Now that we have found an index definition that helped us improving our query, let’s add it to our production database. It’s pretty easy, rails migration DSL can already handle partial indexes :
add_index :messages, [ :id, :created_at ],
where: "is_abusive = TRUE or is_spam = TRUE",
name: "messages_needing_review_idx",
algorithm: :concurrently
The all magic happens in the where
clause. The index will be written to only when the filter condition is true.
Measure & consolidate
Running the query in production confirmed our issue was fixed. Using the measure we implemented in the first section was a good confirmation your partial index helped.
Partial indexes can be delicate; if the filter condition changes, the query planner might not be able to utilize them effectively. Consequently, this could lead to the same issue resurfacing in the future. I want to ensure that future developers, using this query will consistently hit the index.
I added a spec checking if the query planner keeps using the partial index in the future. Adding a postgres view may be another step torward the coupling of the index with the query.
it "uses the partial index" do
expect(query.explain).to include('Index Scan Backward using messages_needing_review_idx on messages')
end
Conclusion
In summary, creating a partial index on my messages
table significantly improved query performance.
Keep in mind that adding an index will add write overhead (otherwise we would add indexes for every single query).