Opened 6 months ago

Closed 3 days ago

#32802 closed project (fixed)

retire kvm4, 8 VMs to migrate

Reported by: anarcat Owned by: anarcat
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords: tpa-roadmap-may
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.

at the very least, we need to get eugeni the heck out of there.

we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.

this requires a new ganeti node (fsn-node-06, #33907).

machines to be migrated:

  • [x] alberti.torproject.org (LDAP) #33908
  • [x] build-x86-09.torproject.org (build server) - RETIRE - replaced with build-x86-11 on gnt-fsn
  • [x] eugeni.torproject.org (email) #32803
  • [x] meronense.torproject.org (metrics) #33909
  • [x] neriniflorum.torproject.org (DNS) #33910
  • [x] oo-hetzner-03.torproject.org (metrics) #33911
  • [x] pauli.torproject.org (puppet) #33912
  • [x] rouyi.torproject.org (jenkins) #33913
  • [x] web-hetzner-01.torproject.org (static mirror) RETIRE
  • [x] weissii.torproject.org (windows build box) #33914
  • [x] winklerianum.torproject.org (windows build box) turned off, migrated along weissii, but turned off
  • [x] woronowii.torproject.org (windows build box) turned off, migrated along weissii, but turned off

Child Tickets

TicketStatusOwnerSummaryComponent
#32803closedweaselmigrate eugeni to the gnt-fsn clusterInternal Services/Tor Sysadmin Team
#33907closedanarcatnew gnt-fsn node (fsn-node-06)Internal Services/Tor Sysadmin Team
#33908closedweaselmigrate alberti to the ganeti clusterInternal Services/Tor Sysadmin Team
#33909closedweaselmeronense IP address change planned for Ganeti migrationInternal Services/Tor Sysadmin Team
#33910closedweaselmigrate neriniflorum to the ganeti clusterInternal Services/Tor Sysadmin Team
#33911closedanarcatoo-hetzner-03 retirementInternal Services/Tor Sysadmin Team
#33912closedweaselmigrate pauli to the ganeti clusterInternal Services/Tor Sysadmin Team
#33913closedweaselmigrate rouyi to the ganeti clusterInternal Services/Service - jenkins
#33914closedweaselmigrate weissii to the ganeti clusterInternal Services/Tor Sysadmin Team

Change History (24)

comment:1 Changed 6 months ago by anarcat

here's the disaster recovery plan i made up on the fly in #32801, which is relevant to the discussion here:

According to the Nextcloud spreadsheet (since LDAP is down), [machines running on kvm4] includes:

host service impact mitigation
alberti LDAP, db.tpo critical, no passwd change read-only copies everywhere
build-x86-09 buildbox redundant N/A
eugeni incoming mail, lists critical, total outage peek at tor-puppet/modules/postfix/files/virtual and email people directly
meronense metrics.tpo critical, total outage ?
neriniflorum DNS redundant, higher TTFB? possible to remove from rotation
oo-hetzner-03 onionoo redundant ?
pauli puppet major, no config management use cumin, local git copies
rouyi jenkins critical, total outage ?
web-hetzner-01 web mirror redundant, no effect? removed from rotation automatically
weissi build box no windows builds N/A
woronowii build box no windows builds N/A

I'll note that it seems both windows build boxes are on the same machine so even if jenkins *would* be able to dispatch builds, we wouldn't be able to do those...

Our disaster recover plan so far is to wait for that rescue to succeed, which might take up to 24h but hopefully less.

If that fails, I would suggest the following plan:

  1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere (we need those three to build new machines)
  2. build a new ganeti cluster (because we can't recover all of this on gnt-fsn)
  3. restore remaining machines on the new cluster
  4. decommission kvm4 officially

This could take a few days of work. :(

Out of that, I would outline the following plan:

  1. in the short term: migrate eugeni, pauli and alberti to a HA cluster, probably gnt-fsn (yes, that means it will be over-allocated even more)
  2. in parallel or after (january): add a node or two to the ganeti cluster
  3. migrate meronense, neriniflorum, oo-hetzner-03, and rouyi to the new cluster

This would leave the following boxes on kvm4, with the following rationale:

  • build-x86-09 - highly redundant, not urgent
  • web-hetzner-01 - one web node already present in the gnt-fsn cluster, moving this will not bring us more redundancy
  • weissi - hard to migrate
  • woronowii - hard to migrate

At that point we'd have the choice to migrate the two windows VM (ugh) and the build box to the ganeti cluster, and we'd probably decom web-hetzner-01 or move it to kvm5 or some other host, then decom kvm4.

How does that sound for a plan?

Tickets would need to be created for each one of those tasks.

comment:2 Changed 6 months ago by anarcat

i will also note that meronense has been seeing disk errors for a while now, in #32692. might be another good indication something is wrong with this box (although mdadm thinks everything is fine).

comment:3 Changed 5 months ago by anarcat

we don't have docs on how to move instances just yet, but i added a section in our ganeti manual that should be filled in when we do. for now it has references to external manuals that could be used:

https://help.torproject.org/tsa/howto/ganeti/#index14h2

comment:4 Changed 3 months ago by anarcat

Keywords: tpa-roadmap-april added

comment:5 Changed 7 weeks ago by anarcat

Description: modified (diff)
Owner: changed from tpa to anarcat
Status: newaccepted
Summary: decomission kvm4retire kvm4, 12 VMs to migrate

add details of the machines to migrate and link to new gnt-fsn node ticket

comment:6 Changed 7 weeks ago by anarcat

Description: modified (diff)
Summary: retire kvm4, 12 VMs to migrateretire kvm4, 8 VMs to migrate

created a ticket for every VM i think should be migrated, which means we would retire 4 VMs here:

  • two windows build boxes
  • a static mirror
  • a build box

does this make sense?

comment:7 Changed 7 weeks ago by anarcat

Description: modified (diff)

re the build box, weasel says we can retire it, but we will eventually need to create build boxes in the gnt-fsn cluster at some point.

web-hetzner-01 can be retired.

comment:8 Changed 3 weeks ago by anarcat

Keywords: tpa-roadmap-may added; tpa-roadmap-april removed

comment:9 Changed 2 weeks ago by weasel

Description: modified (diff)

comment:10 in reply to:  7 Changed 2 weeks ago by weasel

Replying to anarcat:

re the build box, weasel says we can retire it, but we will eventually need to create build boxes in the gnt-fsn cluster at some point.

We have at leaste one build box on gnt-fsn. We can just shut down build-x86-09 when we retire kvm4. I'm leaving it running for now, because it doesn't hurt and it helps a little bit at times.

comment:11 Changed 2 weeks ago by anarcat

Description: modified (diff)

build 9 is replaced by build 11

comment:12 Changed 2 weeks ago by weasel

Description: modified (diff)

comment:13 Changed 2 weeks ago by weasel

Description: modified (diff)

comment:14 Changed 2 weeks ago by weasel

we set up new onionoo infra (cf. #31659), so oo-hetzner-03.torproject.org can be shut down and retired in a day or two

comment:15 Changed 10 days ago by anarcat

Description: modified (diff)

i just retired oo-hetzner-03, the next step here is to retire kvm4 itself, it seems.

comment:16 Changed 10 days ago by anarcat

started retirement procedure:

  1. N/A
  2. removed from nagios
  3. N/A
  4. removed from puppet, backups (fabric)
  5. removed from LDAP:
    380 host=kvm4,ou=hosts,dc=torproject,dc=org 
    objectClass: top 
    objectClass: debianServer 
    host: kvm4 
    hostname: kvm4.torproject.org 
    architecture: amd64 
    admin: torproject-admin@torproject.org 
    sshRSAHostKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCjqP9zljgAnZj666cVW00GaF4rII8t1gYCUvCHX8+OIxffowAuNfxVAfk1AFqCMZCkP8ULLJCevAfrcbc86uVnzQ9J0bND4gFJE4dTDQeCgVrPqR+bt48V2U/KbYC93q2QTRa/UjHitBxO7Z1ryvYh0J0HluJew9ZIBXZ21/uqkqxQ4GWLXo7fXHOHTzKEtP6wwpWiYc9IOEfe4+93vmNX0ubPfgnsAh+2+2/SPUQairOc4c7XmSQM/fyyadoIit/lYANuXfPNidXQtgie1jggyD+Ti72mtHI7pRlVSXKRhXOiuganboENv0Hb9KszStLYnkk/3jJnCAxcGP6VVdpx root@kvm4 
    sshRSAHostKey: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMB96RR18dP0BQ9ySEcpWuBcF4z5rET4KpvzOLKx4m9T root@kvm4 
    description: KVM host 
    purpose: KVM host 
    access: restricted 
    distribution: Debian 
    ipHostNumber: 94.130.38.33 
    ipHostNumber: 2a01:4f8:10b:239f::2 
    l: Falkenstein, Saxony, Germany 
    rebootPolicy: manual
    
  6. removed from source code (puppet, auto-dns, domains, wiki)
  7. removed from tor-passwords
  8. N/A (dnswl)
  9. removed from spreadsheet

last steps: disk wipes and cancelation with hetzner.

comment:17 Changed 10 days ago by anarcat

in progress:

nwipe --autonuke --method=random --verify=off /dev/sdb
nwipe --autonuke --method=random --rounds=2 /dev/nvme1n1

next up is the final self-destruct procedure and cancelation.

ETA ~24h

Last edited 10 days ago by anarcat (previous) (diff)

comment:18 Changed 9 days ago by anarcat

sdb wipe completed. 12h remaining for nvme1n1.

since the machine has been removed from puppet/ldap, its public key is not available from the servers anymore. if you need to connect, you can use the following known_hosts:

94.130.38.33 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMB96RR18dP0BQ9ySEcpWuBcF4z5rET4KpvzOLKx4m9T root@kvm4

... and connect directly to the IP address.

comment:19 Changed 9 days ago by anarcat

interestingly, the nvme wipe completed now. so i'm now entering the last stage of the wiping process, which should logically complete in another 24h.

comment:20 Changed 9 days ago by anarcat

started this final wipe:

nwipe --autonuke --method=random --verify=off /dev/sda ; \
nwipe --autonuke --method=random --rounds=2 /dev/nvme0n1 ; \
echo "SHUTTING DOWN FOREVER IN ONE MINUTE" ; \
sleep 60 ; \
echo o > /proc/sysrq-trigger

comment:21 Changed 9 days ago by anarcat

Status: acceptedmerge_ready

I scheduled deletion with hetzner for now + 2days:

Please note that this server will be cancelled on 28/05/2020 and all data will be deleted.

Confirmation* I have read and understood the above message. I confirm that I want to cancel my server. The cancellation will take effect on 28/05/2020.

i'll keep this opened until the server is canceled, but this is all but done.

comment:22 Changed 8 days ago by anarcat

hum. the first wipe didn't automatically exit, so it probably got hung there for a few hours. i hit "enter" and it started the second round. eta 45 minutes to apocalypse.

comment:23 Changed 8 days ago by anarcat

hum, okay... it failed with an error, and when i tried to open a new script to test, everything collapse in a flaming heap. now the server doesn't respond to pings, so hopefully it's really dead now.

tomorrow hetzner should retire it completely and this will be done.

comment:24 Changed 3 days ago by anarcat

Resolution: fixed
Status: merge_readyclosed

kvm4 gone from hetzner too. all done here.

Note: See TracTickets for help on using tickets.