as part of an effort to reduce our dependence on an old server (moly), we should move dictyotum (a non-redundant server) to a different host, probably the FSN* cluster.
dictyotum being the Bacula director, it might be worth taking this opportunity to test the bacula director recovery procedures (#30880 (moved)). also test the installer problem described in #31781 (moved) while we're here.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Trac: Description: as part of an effort to reduce our dependence on an old server (moly), we should move dictyotum (a non-redundant server) to a different host, probably the FSN* cluster.
dictyotum being the Bacula director, it might be worth taking this opportunity to test the bacula director recovery procedures. also test #31781 (moved) while we're here.
to
as part of an effort to reduce our dependence on an old server (moly), we should move dictyotum (a non-redundant server) to a different host, probably the FSN* cluster.
dictyotum being the Bacula director, it might be worth taking this opportunity to test the bacula director recovery procedures (#30880 (moved)). also test the installer problem described in #31781 (moved) while we're here.
the backup/restore procedures for the director changed, so we might want to test those instead of duplicating the machine. it would also test the bacula::directory class from the bottom up, which would also be a great test.
It picked 116.202.120.168 and 2a01:4f8:fff0:4f:266:37ff:fe90:5790 as IPs and allocated it on fsn-node-02.
I also followed the rest of the procedure in the ganeti and new-machine docs:
changed the root password and set it in our password manager
added reverse DNS to the Hetzner robot
checked fstab, resolv.conf
added to LDAP
added to Puppet
run first upgrade
added to Nagios
added to the spreadsheet
Next step is to run puppet with the bacula::director role, and see what happens. will probably need to setup psql (by hand?) as well. And then decom dictyotum.
i tried to restore the database from dictyotum, and failed. the docs have been updated, but we need to figure out the direct procedure for this to work, because there isn't enough space on the backup server
started the director (it was actually not stopped, but that didn't seem to matter)
re-enabled and ran puppet on the director, holding a lock on the scheduler
switched over a few nodes (perdulce at first, then pauli and alberti) and ran backup jobs on them
switched over all nodes, and ran puppet everywhere
ran puppet on the storage and new director servers
released the lock on the scheduler
I ran another backup job on crm-int-01 because it seems like an important server to backup, and might run more manual jobs on different servers like this, but not all, so we know the scheduler works.
Once all backups return to normal, I guess it will be time to decom dictyotum!
Subject: Cron <bacula@bungei> chronic /usr/local/bin/bacula-unlink-removed-volumes -vTo: root@bungei.torproject.orgDate: Sat, 12 Oct 2019 00:00:02 +0000Traceback (most recent call last): File "/usr/local/bin/bacula-unlink-removed-volumes", line 64, in <module> conn = psycopg2.connect(args.db) File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 130, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync)psycopg2.OperationalError: could not connect to server: Connection refused Is the server running on host "dictyotum.torproject.org" (2620:0:6b0:b:1a1a:0:26e5:481b) and accepting TCP/IP connections on port 5433?could not connect to server: Connection refused Is the server running on host "dictyotum.torproject.org" (38.229.72.27) and accepting TCP/IP connections on port 5433?
as it turns out, postgresql.confalso needed configuring. I tried to add the following statement:
i also had to fix /etc/postgresql-common/pg-service.conf on bungei to point to the right host, but the cert verification still fails. i suspect we'll need to reissue or distribute those around somehow, although it's not clear to me why right now.
it's weekend now, and i think we can survive it without bungei cleaning up its old cruft for now.
2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] FATAL: password authentication failed for user "tor-backup"2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] DETAIL: Password does not match for user "tor-backup". Connection matched pg_hba.conf line 108: "hostssl replication tor-backup 2a01:4f9:2b:1a05::2/128 md5 "
12-Oct 14:05 bacula-director-01.torproject.org-dir JobId 0: Fatal error: bdb.h:142 bdb.h:142 query SELECT ClientId,Uname,AutoPrune,FileRetention,JobRetention FROM Client WHERE Name='mandos-01.torproject.org-fd' failed:
no connection to the server
12-Oct 14:06 bacula-director-01.torproject.org-dir JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES ('mandos-01.torproject.org-fd','',1,2592000,8640000) failed. ERR=no connection to the server
12-Oct 14:07 bacula-director-01.torproject.org-dir JobId 0: Fatal error: Could not create Client record. ERR=Query failed: INSERT INTO Log (JobId, Time, LogText) VALUES (0,'2019-10-12 14:06:47','bacula-director-01.torproject.org-dir JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES (mandos-01.torproject.org-fd,__,1,2592000,8640000) failed. ERR=no connection to the server
'): ERR=no connection to the server
weasel also pointed out that the archive_command that was set is incorrect as it points to the old cluster name (bacula), that was fixed in the config and the docs were updated to check that on deployment.
the email email message was silenced by changing the cluster name in the archive_command in /etc/postgresql/9.6/main/conf.d/tor.conf.
not sure about the former, it triggered again abuot 20 minutes ago which seems to correlate with the last email warning, so maybe that is fixed as well? undetermined - i don't understand why the tor-backup password would have chnaged here since it should be on the bungei side of things. i did not change that password in the deployment.
those are unrelated errors after all, and the tor-backup password changed because that password is in puppet so it was uniquely generated for the new director. i reset the password and started a base backup by hand, which seems to be working correctly now, and running in screen.
it's unclear what happened, but i think restarting the director service solved it.... the emails were still coming in and the backups were not being recorded in status director in the bacula console. now that i restarted the director, backups seem to be queueing in from the scheduler and are being recorded in the status director output.
new director seems to be fully online and operational. it regularly schedules backups and i just performed a test restore to see if that worked as well. it did, although the job creation seemed to hang for a little while for some unknown reason.
dictyotum is now shutdown, will wait until tomorrow to see if anything break, then finish the decom process.
1.undefined the host
4. planned LV removal in 7 days
5. removed from LDAP
6. removed from (reverse) DNS (b.1.8.4.5.e.6.2.0.0.0.0.a.1.a.1.b.0.0.0.0.b.6.0.0.0.0.0.0.2.6.2.ip6.arpa and 27.72.229.38.in-addra.arpa AKA 38.229.72.27 2620.0000.06b0.000b.1a1a.0000.26e5.481b)
7. revoked in puppet
8. removed from puppet code
9. removed from tor-passwords/hosts
10. removed from spreadsheet and wiki
11. removed from nagios
12. scheduled backup removals in 30 days
13. nothing in LE, so N/A
14. not a physical machine, so N/A
That's it! We're done here.
Trac: Status: needs_review to closed Resolution: N/Ato fixed
the at job failed with this (rather unhelpful) error:
Subject: Output from your job 2To: root@moly.torproject.orgDate: Fri, 25 Oct 2019 09:22:00 +0000 Volume group "vgname" not found Cannot process volume group vgname
After looking through history, I found this command:
echo 'lvremove -y vgname/lvname' | at now + 7 days
Now being 7 days after the latest comment here, I assumed this was dictyotum failing to be removed because of an operator (me) error, and could confirm the LV are still there. So I removed them by hand: