Our typical PostgreSQL shard consists of master and two replics. We monitor that master has as much as needed number of replics (we fire WARN event in monitoring if there is only one alive replica and CRIT if there are no alive replics). And we monitor replication lag, replay_location of the replica. All this is done with a couple of easy queries to pg_stat_replication.

This method has two great disadvantages:

  • Most of the data from pg_stat_replication could be taken only by users with SUPERUSER option. Giving such option to monitoring user is not really good idea.
  • We have different threasholds for replication lag because 10 MB of replication lag on cluster with 1 MB/s of writing load and on cluster with 100 MB/s are not the same.

To solve both problems we have written bgworker, sources for which could be taken here.

The princile of operation is really simple — bgworker once in a while (which could be configured with an accuracy of 1 ms) writes in some table (repl_mon by default, but it can be configured) next things:

pgtest02g/postgres M # \dS+ repl_mon
                                 Table "public.repl_mon"
  Column  |           Type           | Modifiers | Storage  | Stats target | Description
----------+--------------------------+-----------+----------+--------------+-------------
 ts       | timestamp with time zone |           | plain    |              |
 location | text                     |           | extended |              |
 replics  | integer                  |           | plain    |              |

pgtest02g/postgres M # select * from repl_mon ;
              ts               |  location  | replics
-------------------------------+------------+---------
 2015-06-14 15:35:51.632041+03 | 0/1E04E568 |       2
(1 row)

Time: 0.664 ms
pgtest02g/postgres M #

Query for getting data could be seen here.

Number of alive replics could be taken directly from this table on master. And on replics values of fields ts and location could be compared with current time and pg_last_xlog_replay_location():

pgtest02d/postgres R # SELECT (current_timestamp - ts) AS lag_time, greatest(0,
pg_xlog_location_diff(location::pg_lsn, pg_last_xlog_replay_location()))
AS lag_bytes FROM repl_mon ;
    lag_time     | lag_bytes
-----------------+-----------
 00:00:00.516017 |         0
(1 row)

Time: 0.724 ms
pgtest02d/postgres R #

Important thing here is that it does not require superuser rights.

For this thing to work you need to execute make and sudo make install in the source directory. And then add repl_mon to shared_preload_libraries and restart PostgreSQL.

I hope, someone will find it useful.

P.S. A special thank is to say to Michael Paquier, who supports pg_plugins — a set of simple templates for PostgreSQL extensions. Most of the code I copied from there.


Comments

comments powered by Disqus