Opened 4 weeks ago

Last modified 45 hours ago

#31786 needs_review task

move dictyotum off moly

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29974 Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

as part of an effort to reduce our dependence on an old server (moly), we should move dictyotum (a non-redundant server) to a different host, probably the FSN* cluster.

dictyotum being the Bacula director, it might be worth taking this opportunity to test the bacula director recovery procedures (#30880). also test the installer problem described in #31781 while we're here.

Child Tickets

Change History (12)

comment:1 Changed 4 weeks ago by anarcat

Description: modified (diff)

comment:2 Changed 2 weeks ago by anarcat

the backup/restore procedures for the director changed, so we might want to test those instead of duplicating the machine. it would also test the bacula::directory class from the bottom up, which would also be a great test.

comment:3 Changed 11 days ago by anarcat

Created a VM on the ganeti cluster with this:

gnt-instance add \
      -o debootstrap+buster \
      -t drbd --no-wait-for-sync \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --disk 2:size=200G,vg=vg_ganeti_hdd \
      --backend-parameters memory=8g,vcpus=2 \
      --net 0:ip=pool,network=gnt-fsn \
      --no-name-check \
      --no-ip-check \
      bacula-director-01.torproject.org

It picked 116.202.120.168 and 2a01:4f8:fff0:4f:266:37ff:fe90:5790 as IPs and allocated it on fsn-node-02.

I also followed the rest of the procedure in the ganeti and new-machine docs:

  1. changed the root password and set it in our password manager
  2. added reverse DNS to the Hetzner robot
  3. checked fstab, resolv.conf
  4. added to LDAP
  5. added to Puppet
  6. run first upgrade
  7. added to Nagios
  8. added to the spreadsheet

Next step is to run puppet with the bacula::director role, and see what happens. will probably need to setup psql (by hand?) as well. And then decom dictyotum.

comment:4 Changed 6 days ago by anarcat

i tried to restore the database from dictyotum, and failed. the docs have been updated, but we need to figure out the direct procedure for this to work, because there isn't enough space on the backup server

comment:5 Changed 4 days ago by anarcat

with weasel's help, i figured out a direct procedure and restored the server, next up is to figure out how to operate the transition. weasel proposed:

  1. set up streaming replication,
  2. shut down the bacula director on dictyotum,
  3. switch things over to the new director with puppet, run it everywhere (make sure director is not restarted on dictyotum)
  4. promote the pg on the new host to primary (or whatever it's called)
  5. see if you see jobs in bconsole

comment:6 Changed 3 days ago by anarcat

Status: assignedneeds_review

the procedure I actually followed is documented in "Restore the directory server" in:

https://help.torproject.org/tsa/howto/backup/#index5h2

Specifically, I have:

  1. shutdown postgres on both servers
  2. redid a BASE backup
  3. restored the base backup on the new server
  4. changed the bacula user password in postgres
  5. started the director (it was actually not stopped, but that didn't seem to matter)
  6. re-enabled and ran puppet on the director, holding a lock on the scheduler
  7. switched over a few nodes (perdulce at first, then pauli and alberti) and ran backup jobs on them
  8. switched over *all* nodes, and ran puppet everywhere
  9. ran puppet on the storage and new director servers
  10. released the lock on the scheduler

I ran another backup job on crm-int-01 because it seems like an important server to backup, and might run more manual jobs on different servers like this, but not all, so we know the scheduler works.

Once all backups return to normal, I guess it will be time to decom dictyotum!

comment:7 Changed 2 days ago by anarcat

got this warning by email from bungei:

Subject: Cron <bacula@bungei> chronic /usr/local/bin/bacula-unlink-removed-volumes -v
To: root@bungei.torproject.org
Date: Sat, 12 Oct 2019 00:00:02 +0000

Traceback (most recent call last):
  File "/usr/local/bin/bacula-unlink-removed-volumes", line 64, in <module>
    conn = psycopg2.connect(args.db)
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection refused
        Is the server running on host "dictyotum.torproject.org" (2620:0:6b0:b:1a1a:0:26e5:481b) and accepting
        TCP/IP connections on port 5433?
could not connect to server: Connection refused
        Is the server running on host "dictyotum.torproject.org" (38.229.72.27) and accepting
        TCP/IP connections on port 5433?

as it turns out, postgresql.conf *also* needed configuring. I tried to add the following statement:

listen_addresses = '*'

but then bungei fails with:

root@bungei:~# psql "service=bacula user=bacula-bungei-reader"
psql: erreur SSL : certificate verify failed

i also had to fix /etc/postgresql-common/pg-service.conf on bungei to point to the right host, but the cert verification still fails. i suspect we'll need to reissue or distribute those around somehow, although it's not clear to me why right now.

it's weekend now, and i think we can survive it without bungei cleaning up its old cruft for now.

comment:8 Changed 2 days ago by anarcat

nevermind that last message, i figured it out and added it to the docs as well.

comment:9 Changed 2 days ago by anarcat

more problems this morning:

2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] FATAL:  password authentication failed for user "tor-backup"
2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] DETAIL:  Password does not match for user "tor-backup".
        Connection matched pg_hba.conf line 108: "hostssl replication     tor-backup      2a01:4f9:2b:1a05::2/128 md5     "

and also:

From: bacula-service@torproject.org 
Subject: Bacula: Backup Fatal Error of mandos-01.torproject.org-fd Differential
To: bacula-service@torproject.org
Date: Sat, 12 Oct 2019 14:08:27 +0000

12-Oct 14:05 bacula-director-01.torproject.org-dir JobId 0: Fatal error: bdb.h:142 bdb.h:142 query SELECT ClientId,Uname,AutoPrune,FileRetention,JobRetention FROM Client WHERE Name='mandos-01.torproject.org-fd' failed:
no connection to the server

12-Oct 14:06 bacula-director-01.torproject.org-dir JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES ('mandos-01.torproject.org-fd','',1,2592000,8640000) failed. ERR=no connection to the server

12-Oct 14:07 bacula-director-01.torproject.org-dir JobId 0: Fatal error: Could not create Client record. ERR=Query failed: INSERT INTO Log (JobId, Time, LogText) VALUES (0,'2019-10-12 14:06:47','bacula-director-01.torproject.org-dir JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES (''mandos-01.torproject.org-fd'','''',1,2592000,8640000) failed. ERR=no connection to the server

'): ERR=no connection to the server


weasel also pointed out that the `archive_command` that was set is incorrect as it points to the old cluster name (`bacula`), that was fixed in the config and the docs were updated to check that on deployment.

comment:10 Changed 2 days ago by anarcat

the email email message was silenced by changing the cluster name in the archive_command in /etc/postgresql/9.6/main/conf.d/tor.conf.

not sure about the former, it triggered again abuot 20 minutes ago which seems to correlate with the last email warning, so maybe that is fixed as well? undetermined - i don't understand why the tor-backup password would have chnaged here since it should be on the bungei side of things. i did not change that password in the deployment.

comment:11 Changed 2 days ago by anarcat

those are unrelated errors after all, and the tor-backup password changed because that password *is* in puppet so it was uniquely generated for the new director. i reset the password and started a base backup by hand, which seems to be working correctly now, and running in screen.

documented in the wiki as well.

comment:12 Changed 45 hours ago by anarcat

it's unclear what happened, but i think restarting the director service solved it.... the emails were still coming in and the backups were not being recorded in status director in the bacula console. now that i restarted the director, backups seem to be queueing in from the scheduler and are being recorded in the status director output.

let's see if this plane can fly for a day now.

Note: See TracTickets for help on using tickets.