Tuning Postgres: Vacuuming & Analyze

5 min readJun 7, 2021

Probably, this can help you to improve your database performance significantly in production. Some problems are not obvious but they are common across platforms or tech stacks but are not triaged or get unearthed until you see production down situation or varying/slow responses reported by the users.

The performance of an insert/update/delete or a select is directly proportional to the size of the dataset/table but this is an incomplete statement as it also depends on the frequency of the updates you might have already performed a few minutes/hours back on a table even if the row count remains the same.

As you may have witnessed(which is why you are reading this ), no amount of tuning/re-architecture on the backend layer is enough unless you make database settings tuned to handle high amounts of updates to the data i.e if your database or even “large” cache doesn't scale, your application doesn't scale. We will see a setting that is the first step to tune the database today.

What happens when the system has too many updates to the database, say in the case of Paytm/GooglePay?

Whenever an update operation happens on a row, the new data(version) is written to the disk ( not replacing the old row ), and the database is asked to refer to the new location on the disk without immediately deleting the old row version. This also happens with ElasticSearch & Cassandra, Please see https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

If you have to relate it to your JVM and have not had the opportunity to look at Cassandra practically & specifically if you are a backend developer and not a database expert, think of this as a String update.

String reference=” this is an example ”;

when we say

reference= reference+” delta”; the new object “ this is an example delta” is created on the heap, the old object still exists and GC will reclaim the space later on.

In the db world,The old location on the disk still holds up the bytes on storage is called a dead tuple, as space is not reclaimed by the OS immediately, this is done to avoid wasting resources on cleaning up the data immediately on update and provide better performance for new updates to the database.The activity that cleans up the dead tuples is called vaccuming.

Why is it important to clean up the dead tuples? is it all about the disk space?

One could say that “the disk is cheap, we can also have a job to clean these dead tuples everyday midnight when there are no users but I don't want CPU to get wasted doing this cleanup work during day times in festive seasons when the user load is high ”

But then the problem is bigger and can be easily ignored if we don't know how the query planner works.

From the official document:

PostgreSQL devises a query plan for each query it receives. Choosing the right plan to match the query structure and the properties of the data is absolutely critical for good performance, so the system includes a complex planner that tries to choose good plans. You can use the EXPLAIN command to see what query plan the planner creates for any query. The query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plan

when you do explain analyze {query} ,

The output might be below if the query planner could not find the most optimized track to execute the query due to a number of reasons.
The below simply means the path which query analyzer chose estimated that it could fetch 70 rows and actually at runtime found 7000

(cost=0.00..483.00 rows=70 width=244) (actual time=0.16..xxxx rows=7000 loops=1)

why does the query analyzer estimate badly?

There are multiple reasons but we will cover the one related to vacuuming.

We will look at the statistics which the pg server uses to estimate the query path at runtime.

if we look at https://www.postgresql.org/docs/9.5/catalog-pg-class.html to understand pg_class and from official document https://www.postgresql.org/docs/9.5/planner-stats.html, it is understood that query planner uses metadata about the table information, relations between tables, indexes, uniqueness /cardinality of the column to estimate the query path. The following table is pulled from the official document for presentation:-

SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE 'tenk1%';

       relname        | relkind | reltuples | relpages
----------------------+---------+-----------+----------
 tenk1                | r       |     10000 |      358
 tenk1_hundred        | i       |     10000 |       30
 tenk1_thous_tenthous | i       |     10000 |       30
 tenk1_unique1        | i       |     10000 |       30
 tenk1_unique2        | i       |     10000 |       30
(5 rows)

“Here we can see that tenk1 contains 10000 rows, as do its indexes, but the indexes are (unsurprisingly) much smaller than the table.
For efficiency reasons, reltuples and relpages are not updated on-the-fly, and so they usually contain somewhat out-of-date values. They are updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX. A VACUUM or ANALYZE operation that does not scan the entire table (which is commonly the case) will incrementally update the reltuples count on the basis of the part of the table it did scan, resulting in an approximate value. In any case, the planner will scale the values it finds in pg_class to match the current physical table size, thus obtaining a closer approximation.”

It is important to have accurate/most recent statistics. The query planner uses statistical information about the contents of tables in order to generate good plans for queries. These statistics are generated by the ANALYZE command but with a lagging auto vacuum process, these updates to the statistic information can lag and outdated information will be provided to the query planner resulting in very bad estimates.

The auto-vacuum daemon will automatically issue ANALYZE commands whenever the content of a table has changed sufficiently.

How to know that vacuuming in your case is lagging behind? When you run the following query, if dead tuples are greater than 20% of live tuples, consider fine-tuning vacuuming, consider increasing auto vacuum threads for better query performance.

SELECT relname AS ObjectName,pg_stat_get_live_tuples(c.oid) AS LiveTuples,pg_stat_get_dead_tuples(c.oid) AS DeadTuplesFROM pg_class c;

The activity which refreshes the statistics is called Analyze .

The autovacuum process has to be fine-tuned to catch up with the churn the users do to the tables. Automatic vaccum= Vacuuming + Analyze.

but wait, there are other reasons even if vacuuming is able to catch up with the churn, vacuuming can be delayed endlessly. I will put that in another story.

Tuning Postgres: Vacuuming & Analyze

Written by Puneet Khanna