move dictyotum off moly

Trac:
Parent Ticket: #29974 (moved)

added component::internal services/tor sysadmin team owner::anarcat parent::29974 priority::medium resolution::fixed severity::normal status::closed type::task labels

Trac:
Description: as part of an effort to reduce our dependence on an old server (moly), we should move dictyotum (a non-redundant server) to a different host, probably the FSN* cluster.

dictyotum being the Bacula director, it might be worth taking this opportunity to test the bacula director recovery procedures. also test #31781 (moved) while we're here.

to

as part of an effort to reduce our dependence on an old server (moly), we should move dictyotum (a non-redundant server) to a different host, probably the FSN* cluster.

dictyotum being the Bacula director, it might be worth taking this opportunity to test the bacula director recovery procedures (#30880 (moved)). also test the installer problem described in #31781 (moved) while we're here.

the backup/restore procedures for the director changed, so we might want to test those instead of duplicating the machine. it would also test the bacula::directory class from the bottom up, which would also be a great test.

Created a VM on the ganeti cluster with this:

gnt-instance add \
      -o debootstrap+buster \
      -t drbd --no-wait-for-sync \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --disk 2:size=200G,vg=vg_ganeti_hdd \
      --backend-parameters memory=8g,vcpus=2 \
      --net 0:ip=pool,network=gnt-fsn \
      --no-name-check \
      --no-ip-check \
      bacula-director-01.torproject.org

It picked 116.202.120.168 and 2a01:4f8:fff0:4f:266:37ff:fe90:5790 as IPs and allocated it on fsn-node-02.

I also followed the rest of the procedure in the ganeti and new-machine docs:

changed the root password and set it in our password manager
added reverse DNS to the Hetzner robot
checked fstab, resolv.conf
added to LDAP
added to Puppet
run first upgrade
added to Nagios
added to the spreadsheet

Next step is to run puppet with the bacula::director role, and see what happens. will probably need to setup psql (by hand?) as well. And then decom dictyotum.

i tried to restore the database from dictyotum, and failed. the docs have been updated, but we need to figure out the direct procedure for this to work, because there isn't enough space on the backup server

with weasel's help, i figured out a direct procedure and restored the server, next up is to figure out how to operate the transition. weasel proposed:

set up streaming replication,
shut down the bacula director on dictyotum,
switch things over to the new director with puppet, run it everywhere (make sure director is not restarted on dictyotum)
promote the pg on the new host to primary (or whatever it's called)
see if you see jobs in bconsole

the procedure I actually followed is documented in "Restore the directory server" in:

https://help.torproject.org/tsa/howto/backup/#index5h2

Specifically, I have:

shutdown postgres on both servers
redid a BASE backup
restored the base backup on the new server
changed the bacula user password in postgres
started the director (it was actually not stopped, but that didn't seem to matter)
re-enabled and ran puppet on the director, holding a lock on the scheduler
switched over a few nodes (perdulce at first, then pauli and alberti) and ran backup jobs on them
switched over all nodes, and ran puppet everywhere
ran puppet on the storage and new director servers
released the lock on the scheduler

I ran another backup job on crm-int-01 because it seems like an important server to backup, and might run more manual jobs on different servers like this, but not all, so we know the scheduler works.

Once all backups return to normal, I guess it will be time to decom dictyotum!

Trac:
Status: assigned to needs_review

got this warning by email from bungei:

Subject: Cron <bacula@bungei> chronic /usr/local/bin/bacula-unlink-removed-volumes -v
To: root@bungei.torproject.org
Date: Sat, 12 Oct 2019 00:00:02 +0000

Traceback (most recent call last):
  File "/usr/local/bin/bacula-unlink-removed-volumes", line 64, in <module>
    conn = psycopg2.connect(args.db)
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection refused
        Is the server running on host "dictyotum.torproject.org" (2620:0:6b0:b:1a1a:0:26e5:481b) and accepting
        TCP/IP connections on port 5433?
could not connect to server: Connection refused
        Is the server running on host "dictyotum.torproject.org" (38.229.72.27) and accepting
        TCP/IP connections on port 5433?

as it turns out, postgresql.conf also needed configuring. I tried to add the following statement:

listen_addresses = '*'

but then bungei fails with:

root@bungei:~# psql "service=bacula user=bacula-bungei-reader"
psql: erreur SSL : certificate verify failed

i also had to fix /etc/postgresql-common/pg-service.conf on bungei to point to the right host, but the cert verification still fails. i suspect we'll need to reissue or distribute those around somehow, although it's not clear to me why right now.

it's weekend now, and i think we can survive it without bungei cleaning up its old cruft for now.

nevermind that last message, i figured it out and added it to the docs as well.

more problems this morning:

2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] FATAL:  password authentication failed for user "tor-backup"
2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] DETAIL:  Password does not match for user "tor-backup".
        Connection matched pg_hba.conf line 108: "hostssl replication     tor-backup      2a01:4f9:2b:1a05::2/128 md5     "

and also:

{{{ From: bacula-service@torproject.org Subject: Bacula: Backup Fatal Error of mandos-01.torproject.org-fd Differential To: bacula-service@torproject.org Date: Sat, 12 Oct 2019 14:08:27 +0000

12-Oct 14:05 bacula-director-01.torproject.org-dir JobId 0: Fatal error: bdb.h:142 bdb.h:142 query SELECT ClientId,Uname,AutoPrune,FileRetention,JobRetention FROM Client WHERE Name='mandos-01.torproject.org-fd' failed: no connection to the server

12-Oct 14:06 bacula-director-01.torproject.org-dir JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES ('mandos-01.torproject.org-fd','',1,2592000,8640000) failed. ERR=no connection to the server

12-Oct 14:07 bacula-director-01.torproject.org-dir JobId 0: Fatal error: Could not create Client record. ERR=Query failed: INSERT INTO Log (JobId, Time, LogText) VALUES (0,'2019-10-12 14:06:47','bacula-director-01.torproject.org-dir JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES (mandos-01.torproject.org-fd,__,1,2592000,8640000) failed. ERR=no connection to the server

'): ERR=no connection to the server

weasel also pointed out that the archive_command that was set is incorrect as it points to the old cluster name (bacula), that was fixed in the config and the docs were updated to check that on deployment.

the email email message was silenced by changing the cluster name in the archive_command in /etc/postgresql/9.6/main/conf.d/tor.conf.

not sure about the former, it triggered again abuot 20 minutes ago which seems to correlate with the last email warning, so maybe that is fixed as well? undetermined - i don't understand why the tor-backup password would have chnaged here since it should be on the bungei side of things. i did not change that password in the deployment.

those are unrelated errors after all, and the tor-backup password changed because that password is in puppet so it was uniquely generated for the new director. i reset the password and started a base backup by hand, which seems to be working correctly now, and running in screen.

documented in the wiki as well.

it's unclear what happened, but i think restarting the director service solved it.... the emails were still coming in and the backups were not being recorded in status director in the bacula console. now that i restarted the director, backups seem to be queueing in from the scheduler and are being recorded in the status director output.

let's see if this plane can fly for a day now.

new director seems to be fully online and operational. it regularly schedules backups and i just performed a test restore to see if that worked as well. it did, although the job creation seemed to hang for a little while for some unknown reason.

dictyotum is now shutdown, will wait until tomorrow to see if anything break, then finish the decom process.

https://help.torproject.org/tsa/howto/retire-a-host/

next step is step 3.

1.undefined the host 4. planned LV removal in 7 days 5. removed from LDAP 6. removed from (reverse) DNS (b.1.8.4.5.e.6.2.0.0.0.0.a.1.a.1.b.0.0.0.0.b.6.0.0.0.0.0.0.2.6.2.ip6.arpa and 27.72.229.38.in-addra.arpa AKA 38.229.72.27 2620.0000.06b0.000b.1a1a.0000.26e5.481b) 7. revoked in puppet 8. removed from puppet code 9. removed from tor-passwords/hosts 10. removed from spreadsheet and wiki 11. removed from nagios 12. scheduled backup removals in 30 days 13. nothing in LE, so N/A 14. not a physical machine, so N/A

That's it! We're done here.

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

the at job failed with this (rather unhelpful) error:

Subject: Output from your job        2
To: root@moly.torproject.org
Date: Fri, 25 Oct 2019 09:22:00 +0000

  Volume group "vgname" not found
  Cannot process volume group vgname

After looking through history, I found this command:

echo 'lvremove -y vgname/lvname' | at now + 7 days

Now being 7 days after the latest comment here, I assumed this was dictyotum failing to be removed because of an operator (me) error, and could confirm the LV are still there. So I removed them by hand:

lvremove vg0/dictyotum-{boot,pg,root,swap}

closed

mentioned in issue #29974 (moved)

move dictyotum off moly

Child items ...

Activity