Opened 4 months ago

Closed 3 months ago

Last modified 3 months ago

#31786 closed task (fixed)

move dictyotum off moly

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29974 Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

as part of an effort to reduce our dependence on an old server (moly), we should move dictyotum (a non-redundant server) to a different host, probably the FSN* cluster.

dictyotum being the Bacula director, it might be worth taking this opportunity to test the bacula director recovery procedures (#30880). also test the installer problem described in #31781 while we're here.

Child Tickets

Change History (15)

comment:1 Changed 4 months ago by anarcat

Description: modified (diff)

comment:2 Changed 4 months ago by anarcat

the backup/restore procedures for the director changed, so we might want to test those instead of duplicating the machine. it would also test the bacula::directory class from the bottom up, which would also be a great test.

comment:3 Changed 4 months ago by anarcat

Created a VM on the ganeti cluster with this:

gnt-instance add \
      -o debootstrap+buster \
      -t drbd --no-wait-for-sync \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --disk 2:size=200G,vg=vg_ganeti_hdd \
      --backend-parameters memory=8g,vcpus=2 \
      --net 0:ip=pool,network=gnt-fsn \
      --no-name-check \
      --no-ip-check \

It picked and 2a01:4f8:fff0:4f:266:37ff:fe90:5790 as IPs and allocated it on fsn-node-02.

I also followed the rest of the procedure in the ganeti and new-machine docs:

  1. changed the root password and set it in our password manager
  2. added reverse DNS to the Hetzner robot
  3. checked fstab, resolv.conf
  4. added to LDAP
  5. added to Puppet
  6. run first upgrade
  7. added to Nagios
  8. added to the spreadsheet

Next step is to run puppet with the bacula::director role, and see what happens. will probably need to setup psql (by hand?) as well. And then decom dictyotum.

comment:4 Changed 4 months ago by anarcat

i tried to restore the database from dictyotum, and failed. the docs have been updated, but we need to figure out the direct procedure for this to work, because there isn't enough space on the backup server

comment:5 Changed 4 months ago by anarcat

with weasel's help, i figured out a direct procedure and restored the server, next up is to figure out how to operate the transition. weasel proposed:

  1. set up streaming replication,
  2. shut down the bacula director on dictyotum,
  3. switch things over to the new director with puppet, run it everywhere (make sure director is not restarted on dictyotum)
  4. promote the pg on the new host to primary (or whatever it's called)
  5. see if you see jobs in bconsole

comment:6 Changed 4 months ago by anarcat

Status: assignedneeds_review

the procedure I actually followed is documented in "Restore the directory server" in:

Specifically, I have:

  1. shutdown postgres on both servers
  2. redid a BASE backup
  3. restored the base backup on the new server
  4. changed the bacula user password in postgres
  5. started the director (it was actually not stopped, but that didn't seem to matter)
  6. re-enabled and ran puppet on the director, holding a lock on the scheduler
  7. switched over a few nodes (perdulce at first, then pauli and alberti) and ran backup jobs on them
  8. switched over *all* nodes, and ran puppet everywhere
  9. ran puppet on the storage and new director servers
  10. released the lock on the scheduler

I ran another backup job on crm-int-01 because it seems like an important server to backup, and might run more manual jobs on different servers like this, but not all, so we know the scheduler works.

Once all backups return to normal, I guess it will be time to decom dictyotum!

comment:7 Changed 4 months ago by anarcat

got this warning by email from bungei:

Subject: Cron <bacula@bungei> chronic /usr/local/bin/bacula-unlink-removed-volumes -v
Date: Sat, 12 Oct 2019 00:00:02 +0000

Traceback (most recent call last):
  File "/usr/local/bin/bacula-unlink-removed-volumes", line 64, in <module>
    conn = psycopg2.connect(args.db)
  File "/usr/lib/python3/dist-packages/psycopg2/", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection refused
        Is the server running on host "" (2620:0:6b0:b:1a1a:0:26e5:481b) and accepting
        TCP/IP connections on port 5433?
could not connect to server: Connection refused
        Is the server running on host "" ( and accepting
        TCP/IP connections on port 5433?

as it turns out, postgresql.conf *also* needed configuring. I tried to add the following statement:

listen_addresses = '*'

but then bungei fails with:

root@bungei:~# psql "service=bacula user=bacula-bungei-reader"
psql: erreur SSL : certificate verify failed

i also had to fix /etc/postgresql-common/pg-service.conf on bungei to point to the right host, but the cert verification still fails. i suspect we'll need to reissue or distribute those around somehow, although it's not clear to me why right now.

it's weekend now, and i think we can survive it without bungei cleaning up its old cruft for now.

comment:8 Changed 4 months ago by anarcat

nevermind that last message, i figured it out and added it to the docs as well.

comment:9 Changed 4 months ago by anarcat

more problems this morning:

2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] FATAL:  password authentication failed for user "tor-backup"
2019-10-12 05:53:41.638 UTC [18889] tor-backup@[unknown] DETAIL:  Password does not match for user "tor-backup".
        Connection matched pg_hba.conf line 108: "hostssl replication     tor-backup      2a01:4f9:2b:1a05::2/128 md5     "

and also:

Subject: Bacula: Backup Fatal Error of Differential
Date: Sat, 12 Oct 2019 14:08:27 +0000

12-Oct 14:05 JobId 0: Fatal error: bdb.h:142 bdb.h:142 query SELECT ClientId,Uname,AutoPrune,FileRetention,JobRetention FROM Client WHERE Name='' failed:
no connection to the server

12-Oct 14:06 JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES ('','',1,2592000,8640000) failed. ERR=no connection to the server

12-Oct 14:07 JobId 0: Fatal error: Could not create Client record. ERR=Query failed: INSERT INTO Log (JobId, Time, LogText) VALUES (0,'2019-10-12 14:06:47',' JobId 0: Error: sql_create.c:524 Create DB Client record INSERT INTO Client (Name,Uname,AutoPrune,FileRetention,JobRetention) VALUES ('''','''',1,2592000,8640000) failed. ERR=no connection to the server

'): ERR=no connection to the server

weasel also pointed out that the `archive_command` that was set is incorrect as it points to the old cluster name (`bacula`), that was fixed in the config and the docs were updated to check that on deployment.

comment:10 Changed 4 months ago by anarcat

the email email message was silenced by changing the cluster name in the archive_command in /etc/postgresql/9.6/main/conf.d/tor.conf.

not sure about the former, it triggered again abuot 20 minutes ago which seems to correlate with the last email warning, so maybe that is fixed as well? undetermined - i don't understand why the tor-backup password would have chnaged here since it should be on the bungei side of things. i did not change that password in the deployment.

comment:11 Changed 4 months ago by anarcat

those are unrelated errors after all, and the tor-backup password changed because that password *is* in puppet so it was uniquely generated for the new director. i reset the password and started a base backup by hand, which seems to be working correctly now, and running in screen.

documented in the wiki as well.

comment:12 Changed 4 months ago by anarcat

it's unclear what happened, but i think restarting the director service solved it.... the emails were still coming in and the backups were not being recorded in status director in the bacula console. now that i restarted the director, backups seem to be queueing in from the scheduler and are being recorded in the status director output.

let's see if this plane can fly for a day now.

comment:13 Changed 3 months ago by anarcat

new director seems to be fully online and operational. it regularly schedules backups and i just performed a test restore to see if that worked as well. it did, although the job creation seemed to hang for a little while for some unknown reason.

dictyotum is now shutdown, will wait until tomorrow to see if anything break, then finish the decom process.

next step is step 3.

Last edited 3 months ago by anarcat (previous) (diff)

comment:14 Changed 3 months ago by anarcat

Resolution: fixed
Status: needs_reviewclosed
  1. undefined the host
  2. planned LV removal in 7 days
  3. removed from LDAP
  4. removed from (reverse) DNS ( and AKA 2620.0000.06b0.000b.1a1a.0000.26e5.481b)
  5. revoked in puppet
  6. removed from puppet code
  7. removed from tor-passwords/hosts
  8. removed from spreadsheet and wiki
  9. removed from nagios
  10. scheduled backup removals in 30 days
  11. nothing in LE, so N/A
  12. not a physical machine, so N/A

That's it! We're done here.

comment:15 Changed 3 months ago by anarcat

the at job failed with this (rather unhelpful) error:

Subject: Output from your job        2
Date: Fri, 25 Oct 2019 09:22:00 +0000

  Volume group "vgname" not found
  Cannot process volume group vgname

After looking through history, I found this command:

echo 'lvremove -y vgname/lvname' | at now + 7 days

Now being 7 days after the latest comment here, I assumed this was dictyotum failing to be removed because of an operator (me) error, and could confirm the LV are still there. So I removed them by hand:

lvremove vg0/dictyotum-{boot,pg,root,swap}
Note: See TracTickets for help on using tickets.