Opened 6 months ago

Last modified 6 months ago

#33477 assigned defect

automate host retirement / decommissionning procedure

[we're removing a surprising number of servers. in the last few months, we're retired about half a dozen servers. that procedure is currently entirely manual, and quite error-prone, especially because of at(1) jobs errors or omitted steps. backup removal can be forgotten or typo'd and we've forgotten to remove entries in the spreadsheet or nagios a few times.

we should automate this process. this also has the added benefit of simplifying the migration process: we have a ton of servers that need to move from libvirt into the ganeti cluster, and part of that process involves retiring the old copy of the server.

right now the documentation is in the retire-a-host page, but i somehow got into the habit of calling tickets decomission host FOO. so we need to decide between "retire" and "decommission" as a naming convention, before we start writing code.

the two contestants are:

  • retire, retirement, retiring, retired
  • decommission, decommissioning procedure, decommissioning, decommisssioned

a quick IRC survey indicates people favor the former family because it's shorter and more familiar.

decommission was seen as a "nice word". it's also less ambiguous: "retire" can also refer to a user, and it could also imply the host sticks around for a while and rant about pain in his lower backs and babble nonsense when guests are around, just to embarrass us. the problem with decommission is that I can't spell it to save my life. it also doesn't have a "name" (an action, like "retirement") so it makes it sometimes awkward to refer to.

so we favor converging over "retire" for now.

i have (somehow by accident) started this work while researching fabric. i was trying to figure out how easy it would be to replace a few steps in the procedure with fabric, and it turns out it's quite easy.

so this might happen sooner than i expected.

