Once upon a time as a result of two human errors while deploying new code on our databases we did DROP SCHEMA data CASCADE; on all shards of one of our clusters with more than 3 TB of data. It added us gray hair, allowed us to check our PITR skills in production and made us to treat backups differently.

That story had happy end. The incident occured in the end of the working day when workload was already descreasing and by morning we restored everything from backups to the needed point of time. We have always been doing backups and have always been monitoring the fact they are done. But we threw checking of the ability to restore from them when we migrated to barman because of high cost.

Recovery of one shard took more time than others because we could not restore from last backup and we had to restore from second last (we do backups every night). For that reason after fuckup we decided to get back checking of backups consistency. As a result there are a couple of scripts which could be seen here. One of them (check_backup_consistency.py) sequentially deploys last backup of each cluster, starts PostgreSQL with recovery_target = 'immediate' and waits for reaching consistent state.

The second one (check_xlogs.sh) checks that backup server contains all needed WALs (from the first WAL of first backup to the last archived WAL). Generally, archiver guarantees the sequence in archiving WALs and if you configure archive_command the right way you should not have problems with that. But we had situations when free space on partition with pg_xlog ended and we changed archive_command to move WALs locally. The first deploy would return archive_command back but locally copied WALs could be forgotten.

We run these checks with cron and monitoring scripts look at status-files created in /tmp. We start doing backups at 2 a.m. and the last one ends around 6 a.m. (thanks to incremental backups in barman 1.4). And in the middle of the day (around 2-3 p.m.) we already know if our backups are consistent and if we can do DROP SCHEMA again :)

Perhaps, someone would find this scripts useful. Feel free to ask questions.


Comments

comments powered by Disqus