Opened 5 weeks ago

Last modified 3 days ago

#32802 new project

decomission kvm4

Reported by: anarcat Owned by: tpa
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.

at the very least, we need to get eugeni the heck out of there.

we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.

Child Tickets

TicketStatusOwnerSummaryComponent
#32803newtpamigrate eugeni to the gnt-fsn clusterInternal Services/Tor Sysadmin Team

Change History (3)

comment:1 Changed 5 weeks ago by anarcat

here's the disaster recovery plan i made up on the fly in #32801, which is relevant to the discussion here:

According to the Nextcloud spreadsheet (since LDAP is down), [machines running on kvm4] includes:

host service impact mitigation
alberti LDAP, db.tpo critical, no passwd change read-only copies everywhere
build-x86-09 buildbox redundant N/A
eugeni incoming mail, lists critical, total outage peek at tor-puppet/modules/postfix/files/virtual and email people directly
meronense metrics.tpo critical, total outage ?
neriniflorum DNS redundant, higher TTFB? possible to remove from rotation
oo-hetzner-03 onionoo redundant ?
pauli puppet major, no config management use cumin, local git copies
rouyi jenkins critical, total outage ?
web-hetzner-01 web mirror redundant, no effect? removed from rotation automatically
weissi build box no windows builds N/A
woronowii build box no windows builds N/A

I'll note that it seems both windows build boxes are on the same machine so even if jenkins *would* be able to dispatch builds, we wouldn't be able to do those...

Our disaster recover plan so far is to wait for that rescue to succeed, which might take up to 24h but hopefully less.

If that fails, I would suggest the following plan:

  1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere (we need those three to build new machines)
  2. build a new ganeti cluster (because we can't recover all of this on gnt-fsn)
  3. restore remaining machines on the new cluster
  4. decommission kvm4 officially

This could take a few days of work. :(

Out of that, I would outline the following plan:

  1. in the short term: migrate eugeni, pauli and alberti to a HA cluster, probably gnt-fsn (yes, that means it will be over-allocated even more)
  2. in parallel or after (january): add a node or two to the ganeti cluster
  3. migrate meronense, neriniflorum, oo-hetzner-03, and rouyi to the new cluster

This would leave the following boxes on kvm4, with the following rationale:

  • build-x86-09 - highly redundant, not urgent
  • web-hetzner-01 - one web node already present in the gnt-fsn cluster, moving this will not bring us more redundancy
  • weissi - hard to migrate
  • woronowii - hard to migrate

At that point we'd have the choice to migrate the two windows VM (ugh) and the build box to the ganeti cluster, and we'd probably decom web-hetzner-01 or move it to kvm5 or some other host, then decom kvm4.

How does that sound for a plan?

Tickets would need to be created for each one of those tasks.

comment:2 Changed 5 weeks ago by anarcat

i will also note that meronense has been seeing disk errors for a while now, in #32692. might be another good indication something is wrong with this box (although mdadm thinks everything is fine).

comment:3 Changed 3 days ago by anarcat

we don't have docs on how to move instances just yet, but i added a section in our ganeti manual that should be filled in when we do. for now it has references to external manuals that could be used:

https://help.torproject.org/tsa/howto/ganeti/#index14h2

Note: See TracTickets for help on using tickets.