#30880 closed task (fixed)

document backup/restore procedures

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


Backup system design and restore procedures are currently not well documented in our wiki. Try a few restores and document the heck out of this. The ops report card recommends services be documented with a template like this:

  1. Overview: Overview of the service: what is it, why do we have it, who are the primary contacts, how to report bugs, links to design docs and other relevant information.
  2. Build: How to build the software that makes the service. Where to download it from, where the source code repository is, steps for building and making a package or other distribution mechanisms. If it is software that you modify in any way (open source project you contribute to or a local project) include instructions for how a new developer gets started. Ideally the end result is a package that can be copied to other machines for installation.
  3. Deploy: How to deploy the software. How to build a server from scratch: RAM/disk requirements, OS version and configuration, what packages to install, and so on. If this is automated with a configuration management tool like cfengine/puppet/chef (and it should be), then say so.
  4. Common Tasks: Step-by-step instructions for common things like provisioning (add/change/delete), common problems and their solutions, and so on.
  5. Pager Playbook: A list of every alert your monitoring system may generate for this service and a step-by-step "what do to when..." for each of them.
  6. DR: Disaster Recovery Plans and procedure. If a service machine died how would you fail-over to the hot/cold spare?
  7. SLA: Service Level Agreement. The (social or real) contract you make with your customers. Typically things like Uptime Goal (how many 9s), RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

While we don't use that template anywhere yet (and it somehow conflicts with the documentation best practices, we can probably find a middle ground of some sort...

Child Tickets

Change History (9)

comment:1 Changed 15 months ago by anarcat

i dumped some stuff in https://help.torproject.org/tsa/howto/backup/ after doing some tests.

notable stuff missing:

  • postgres backup setup
  • postgres restore procedure
  • disaster recovery procedure (baremetal recovery)
  • anything else?

comment:2 Changed 14 months ago by anarcat

Status: assignedaccepted

i documented postgresql and mysql backup design and restore procedures.

one thing that's missing is barebones recovery. i also wonder how to recover from a failure on the directory server. and finally, the host retirement procedure doesn't include backups, so we should document that as well.

i looked a bit at the host retirement procedure, and it seems it's not well supported in Bacula. i found shell scripts like this or this or this that manually look at the database and prune old stuff.

it seems that manually deleting the volumes is basically how things work: the prune/purge/delete commands in the bconsole documentation all say they only operate on the Catalog, not the actual volume/pool, which is manually operated.

but i'd sure like to have a second opinion here.

anyways, in short, I'd like to add the following documentation befoer this ticket can be closed:

  • disaster recovery
  • directory server failure
  • host retirement procedure
Last edited 14 months ago by anarcat (previous) (diff)

comment:3 Changed 14 months ago by anarcat

This SQL query will create a series of commands that can be fed into bconsole to "purge" old volumes:

SELECT 'prune yes volume=' || volumename FROM media WHERE lastwritten < NOW() - INTERVAL '30 days' AND poolid IN (SELECT poolid FROM pool WHERE name LIKE '%arlgirdense.torproject.org') ORDER BY lastwritten;

Yet this lists volumes that are already pruned, so I'm not sure it's the right thing. Maybe we should use delete instead of prune? In this, according to Langille's blog, we'd also need AND voljobs=0 to avoid deleting volumes with jobs referencing it.

Our database is pretty big:

/dev/vdc           197G    131G   57G  70% /var/lib/postgresql

... so it might actually be worth removing old cruft. But our priority should be to remove the actual *files* from older backups, and maybe that's as simple as removing them by hand from the storage daemon.

comment:4 Changed 14 months ago by anarcat

directory server failure is now partly documented, but untested.

comment:5 Changed 14 months ago by anarcat

it seems that just removing the data from the host is okay for now. we still need to figure out how to tell bacula a client is dead, so that its volumes and pools and jobs and so on are purged from the database.

comment:6 Changed 14 months ago by anarcat

there's an opportunity to test the director recovery procedure in #29974

i'd suggest we don't need to purge the psql database for now, that's kind of out of scope of daily operations.

comment:7 Changed 13 months ago by weasel

jobs expire with time. volumes become empty with time.

/etc/bacula/scripts/volumes-delete-old, run out of cron, deletes empty volumes and empty pools that do not belong to any client currently known to bacula/puppet.

comment:8 Changed 13 months ago by anarcat

jobs expire with time. volumes become empty with time.

/etc/bacula/scripts/volumes-delete-old, run out of cron, deletes empty volumes and empty pools that do not belong to any client currently known to bacula/puppet.

so if I understand you correctly, there's actually nothing to do here? pools, volumes, jobs and all the stuff in the database naturally expires with time and we don't need to cleanup anything when we remove hosts for example?

comment:9 Changed 12 months ago by anarcat

Resolution: fixed
Status: acceptedclosed

let's call https://help.torproject.org/tsa/howto/backup/ complete for now. i've successfully restored the director in #31786 and i believe this completes the minimum backup documentation.

Note: See TracTickets for help on using tickets.