Opened 3 months ago

Last modified 13 days ago

#30880 accepted task

document backup/restore procedures

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Backup system design and restore procedures are currently not well documented in our wiki. Try a few restores and document the heck out of this. The ops report card recommends services be documented with a template like this:

  1. Overview: Overview of the service: what is it, why do we have it, who are the primary contacts, how to report bugs, links to design docs and other relevant information.
  2. Build: How to build the software that makes the service. Where to download it from, where the source code repository is, steps for building and making a package or other distribution mechanisms. If it is software that you modify in any way (open source project you contribute to or a local project) include instructions for how a new developer gets started. Ideally the end result is a package that can be copied to other machines for installation.
  3. Deploy: How to deploy the software. How to build a server from scratch: RAM/disk requirements, OS version and configuration, what packages to install, and so on. If this is automated with a configuration management tool like cfengine/puppet/chef (and it should be), then say so.
  4. Common Tasks: Step-by-step instructions for common things like provisioning (add/change/delete), common problems and their solutions, and so on.
  5. Pager Playbook: A list of every alert your monitoring system may generate for this service and a step-by-step "what do to when..." for each of them.
  6. DR: Disaster Recovery Plans and procedure. If a service machine died how would you fail-over to the hot/cold spare?
  7. SLA: Service Level Agreement. The (social or real) contract you make with your customers. Typically things like Uptime Goal (how many 9s), RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

While we don't use that template anywhere yet (and it somehow conflicts with the documentation best practices, we can probably find a middle ground of some sort...

Child Tickets

Change History (6)

comment:1 Changed 7 weeks ago by anarcat

i dumped some stuff in https://help.torproject.org/tsa/howto/backup/ after doing some tests.

notable stuff missing:

  • postgres backup setup
  • postgres restore procedure
  • disaster recovery procedure (baremetal recovery)
  • anything else?

comment:2 Changed 5 weeks ago by anarcat

Status: assignedaccepted

i documented postgresql and mysql backup design and restore procedures.

one thing that's missing is barebones recovery. i also wonder how to recover from a failure on the directory server. and finally, the host retirement procedure doesn't include backups, so we should document that as well.

i looked a bit at the host retirement procedure, and it seems it's not well supported in Bacula. i found shell scripts like this or this or this that manually look at the database and prune old stuff.

it seems that manually deleting the volumes is basically how things work: the prune/purge/delete commands in the bconsole documentation all say they only operate on the Catalog, not the actual volume/pool, which is manually operated.

but i'd sure like to have a second opinion here.

anyways, in short, I'd like to add the following documentation befoer this ticket can be closed:

  • disaster recovery
  • directory server failure
  • host retirement procedure
Last edited 5 weeks ago by anarcat (previous) (diff)

comment:3 Changed 5 weeks ago by anarcat

This SQL query will create a series of commands that can be fed into bconsole to "purge" old volumes:

SELECT 'prune yes volume=' || volumename FROM media WHERE lastwritten < NOW() - INTERVAL '30 days' AND poolid IN (SELECT poolid FROM pool WHERE name LIKE '%arlgirdense.torproject.org') ORDER BY lastwritten;

Yet this lists volumes that are already pruned, so I'm not sure it's the right thing. Maybe we should use delete instead of prune? In this, according to Langille's blog, we'd also need AND voljobs=0 to avoid deleting volumes with jobs referencing it.

Our database is pretty big:

/dev/vdc           197G    131G   57G  70% /var/lib/postgresql

... so it might actually be worth removing old cruft. But our priority should be to remove the actual *files* from older backups, and maybe that's as simple as removing them by hand from the storage daemon.

comment:4 Changed 5 weeks ago by anarcat

directory server failure is now partly documented, but untested.

comment:5 Changed 5 weeks ago by anarcat

it seems that just removing the data from the host is okay for now. we still need to figure out how to tell bacula a client is dead, so that its volumes and pools and jobs and so on are purged from the database.

comment:6 Changed 13 days ago by anarcat

there's an opportunity to test the director recovery procedure in #29974

i'd suggest we don't need to purge the psql database for now, that's kind of out of scope of daily operations.

Note: See TracTickets for help on using tickets.