Opened 14 months ago

Closed 9 months ago

Last modified 8 months ago

#31686 closed project (fixed)

retire textile

Reported by: anarcat Owned by: anarcat
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

textile is one of the first machines in the KVM* series. weasel proposed we move all its VM into the new FSN cluster and retire the box to start saving some money, and eventually grow the cluster.

Child Tickets

TicketStatusOwnerSummaryComponent
#31676closedweaseldecommission togashiiInternal Services/Tor Sysadmin Team
#31700closedanarcatdecomission jabber serverInternal Services/Service - jabber
#32281closedanarcatset up new IRC box to replace iranicumInternal Services/Tor Sysadmin Team

Change History (22)

comment:1 Changed 14 months ago by anarcat

i'm all for making use of the new cluster, but I'm thinking that we should also keep space on the fsn* cluster to alleviate the load on kvm4, which is suffering a bit.

i'd also like to move the bacula director off of moly, which i distrust, see also #29974.

Last edited 14 months ago by anarcat (previous) (diff)

comment:2 Changed 14 months ago by anarcat

one of those machines is chamaemoly, which will be decommissioned altogether and shouldn't be migrated, see #31700.

comment:3 Changed 14 months ago by anarcat

it's actually one two three four (!) machines that are are to eventually be retired so we shouldn't migrate those either:

this leaves us with only those machines to migrate before textile can be decommissioned, yaay!

 	iranicum.torproject.org:
	 - shell/irc box
	saxatile.torproject.org:
	 - www.torproject.org
	 - static content rotation
	weissii.torproject.org:
	 - Windows buildbox
	winklerianum.torproject.org:
	 - Windows buildbox
Last edited 14 months ago by anarcat (previous) (diff)

comment:4 Changed 13 months ago by weasel

derotated saxatile

comment:5 Changed 13 months ago by weasel

windows VMs moved to kvm4. (winklerianum is not defined in virsh, but the data is there)

comment:6 Changed 13 months ago by anarcat

saxatile removed from ldap, disks queued for removal in 7 days, removed from puppet, nothing left in DNS, entry removed from spreadsheet, nagios, tor-passwords, and scheduled backup deletion in 30 days.

saxatile is done.

comment:7 Changed 9 months ago by anarcat

Owner: changed from tpa to anarcat
Priority: MediumHigh
Status: newassigned

i'll fire this up next week and retire textile so we can loop in another gnt node already.

comment:8 Changed 9 months ago by anarcat

Status: assignedaccepted

starting work on this, will document a migration procedure as i go along.

comment:9 Changed 9 months ago by anarcat

i've disconnected the SVN and chiwui shutdown from this procedure because I don't want to wait for those projects to complete before decommissionning textile.

chiwui has been imported on fsn-node-03, but not converted yet. still need to do some functional tests and DNS changes before the switchover. documentation in https://help.torproject.org/tsa/howto/ganeti/#Importing_external_instances

Last edited 9 months ago by anarcat (previous) (diff)

comment:10 Changed 9 months ago by anarcat

i've done an import of gayi as well and improved the procedure to support the multiple disks in use there as well as the swap initialization. i've also reversed the sync path so we don't have to trust the old node.

unfortunately, fsn-node-03 has shown (HDD) disk problems which is (obviously) slowing down this work. they replaced the drive but the problem came up again, so i'll put this on hold for now.

the base images are all stored in /srv for now, a logical volume created for that purpose. i'll wait for news from hetzner before going any further.

worst case we setup a second ganeti node or just use the SSDs, which haven't shown sign of problems so far.

comment:11 Changed 9 months ago by anarcat

opened #33098 to followup on the disk issue, which is blocking this deployment.

comment:12 Changed 9 months ago by anarcat

gayi migrated to gnt-fsn, DNS records updated. next step is gayi decom.

comment:13 Changed 9 months ago by anarcat

gayi decom procedure done:

  1. N/A
  2. shutdown
  3. undefined
  4. scheduled removal in 7 days: echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
  5. N/A: gayi remains for now, because it's still on gnt-fsn
  6. N/A
  7. N/A
  8. N/A
  9. N/A
  10. move to gnt-fsn in the spreadsheet
  11. N/A
  12. N/A
  13. N/A
  14. N/A
  15. N/A

comment:14 Changed 9 months ago by anarcat

chiwui specs:

  • 20G, 4G swap
  • 2G RAM
  • 2CPU
  • SWAP_UUID=484b554e-fb17-4330-94e6-9a3fa3f8e1ed
  • NEW_IP=116.202.120.176

add:mac=00:66:37:24:f1:63,ip=116.202.120.177,mode=openvswitch,link=br0,network=gnt-fsn

There are two old IP addresses, strangely: 138.201.14.212 and 138.201.14.213, respectively chiwui2 and chiwui4 in DNS, not sure why that is.

i migrated a first version of the machine over and things still seem to work, although it's hard to tell if TBB is pinging the old server or the new (probably the former, unfortunately). will start the final migration now.

correct interfaces file:

auto eth0
iface eth0 inet static
    address 116.202.120.176/27
    gateway 116.202.120.161
iface eth0 inet6 static
    accept_ra 0
    address 2a01:4f8:fff0:4f:266:37ff:fe5a:8583/64
    gateway 2a01:4f8:fff0:4f::1

auto eth1
iface eth1 inet static
    address 116.202.120.177/32
iface eth1 inet6 static
    address 2a01:4f8:fff0:4f:266:37ff:fe24:f163/64

the two IP addresses are necessary for check to operate, because there are two services on port 80 (the normal webserver and tordnssel). the latter also requires IP changes in /srv, which should be grepped on top of /etc for IP address in the final run.

we're at step 11 for chiwui.

Last edited 9 months ago by anarcat (previous) (diff)

comment:15 Changed 9 months ago by anarcat

instance was recreated, new interfaces snippet:

auto eth0
iface eth0 inet static
    address 116.202.120.176/27
    gateway 116.202.120.161
iface eth0 inet6 static
    accept_ra 0
    address 2a01:4f8:fff0:4f:266:37ff:fe69:3bda/64
    gateway 2a01:4f8:fff0:4f::1

auto eth1
iface eth1 inet static
    address 116.202.120.177/27
iface eth1 inet6 static
    address 2a01:4f8:fff0:4f:266:37ff:fea4:3cf3/64

  1. changed IP in LDAP
  2. changed the firewall rules in puppet and deployed
  3. changed IP in DNS

remaining work:

  • change IP in nagios
  • change reverse DNS

in the meantime, the TTL is hurting us: there are 1h records on that thing, so everything is timing out.... https://check.torproject.org/ works in a regular browser when hacking at my /etc/hosts however, so presumably *that* part will work once the tor network catches up (through DNS propagation)

similarly, tordnsel seems to work:

$ dig +short 62.62.129.185.80.4.3.2.1.ip-port.exitlist.torproject.org @116.202.120.176
127.0.0.2

so let's wait. i flipped DNS at around Tue Feb 4 14:49:00 2020 -0500 (commit time, pushed shortly after). so things should coverge in maximum ~32 minutes now.

comment:16 Changed 9 months ago by anarcat

changed IP in nagios and changed reverse DNS records (i had forgotten gayi too!).

next step is to decom chiwui completely, then we wait 7 days until we decom textile!

comment:17 Changed 9 months ago by anarcat

chiwui host retirement procedure:

  1. N/A
  2. done
  3. "done" (only removed from autostart but kept the xml file in case we want to restore this in a pinch)
  4. done:
    root@textile:/etc/libvirt# echo 'rm -r /srv/vmstore/chiwui.torproject.org/' | at now + 7 days
    warning: commands will be executed using /bin/sh
    job 6 at Tue Feb 11 20:31:00 2020
    
  5. N/A
  6. N/A
  7. N/A
  8. N/A
  9. N/A
  10. moved to the gnt-fsn cluster in the spreadsheet
  11. N/A
  12. N/A
  13. N/A
  14. N/A
  15. N/A

chiwui can be considered fully migrated now. next step is to decomission textile, on february 11th.

comment:18 Changed 9 months ago by anarcat

posted a notification about the change to the mailing list here : https://lists.torproject.org/pipermail/tor-project/2020-February/002696.html

comment:19 Changed 9 months ago by anarcat

textile decom procedure has started. i'm at step 4, the first three steps being a noop.

i kicked sdb off the RAID array and i'm running the badblocks clear over it. first estimate is the first wipe will take 5 hours.

comment:20 Changed 9 months ago by anarcat

it was still rewriting the disk this morning:

[...]
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing:  60.29% done, 22:53:22 elapsed. (0/0/0 errors)

23 hours! that seems unreasonable. i have interrupted that process, installed nwipe, and changed the procedure to use nwipe instead of badblocks.

we're now at the final wiping stage. i have a screen open with nwipe running without a GUI. last estimates I saw (in the GUI) were about 6 hours for one drive, so we might expect 12 hours from now on for a complete wipe.

once the wipe completes, i'll tell hetzner to decom the server.

now moving on with the rest of the procedure.

comment:21 Changed 9 months ago by anarcat

Resolution: fixed
Status: acceptedclosed

decom procedure checklist:

  1. N/A
  2. N/A
  3. N/A
  4. in progress
  5. removed this chunk from LDAP:
    330 host=textile,ou=hosts,dc=torproject,dc=org
    objectClass: top
    objectClass: debianServer
    host: textile
    hostname: textile.torproject.org
    architecture: amd64
    admin: torproject-admin@torproject.org
    sshRSAHostKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDHfKOngWQNU5i6b2wNMM1Twstj7VXUM1Jg/Ud8D5L7w5VzJV5hQeDebyqGuomtjDWG+YCahZwa9ezIJxU+F2uV7DygYglzVjL3WdXoY9BJUdykvQAeQbuzO8jkIPqHXAhtl6IdhcctvIBjWAdlagbNXSWYxPTwCYoPdxmWfcCvj0871jDykrOvhkh+woxyKir1QTnR0Uu+c2E/UROanfSqZMKNfGk26nySEtRM9/FwkPvjr4hD5s3nHE8RR5SQqGqbBN1I7n2tyZHrU0Q12dI7XxOuWjCmnalWrsau12jpYJssoDt3i5zjOW3BfFkwfY9Fsmzigqh9mgTnHPT6KhgJ root@textile
    sshRSAHostKey: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKdcJkveK7SKIMgQ6KSGvjsb4K17bW2Jsq8kaFmEuRna root@textile
    description: KVM host
    access: restricted
    distribution: Debian
    ipHostNumber: 138.201.66.71
    ipHostNumber: 2a01:4f8:172:1b46::2
    l: Falkenstein, Saxony, Germany
    purpose: KVM host
    purpose: [[kvm1.torproject.org]]
    
  6. done, removed from auto-dns and domains
  7. done:
    root@pauli:/home/anarcat# host=textile && puppet node clean $host.torproject.org && puppet node deactivate $host.torproject.org
    Notice: Revoked certificate with serial 69
    Notice: Removing file Puppet::SSL::Certificate textile.torproject.org at '/var/lib/puppet/ssl/ca/signed/textile.torproject.org.pem'
    textile.torproject.org
    Submitted 'deactivate node' for textile.torproject.org with UUID fcf5579f-1369-45a3-b230-76382aa1f634
    
  8. removed textile from a bunch of places in puppet (ipsec, hiera, hosters.yaml, tor-install-VM and virt.pp) see ccee6856 in tor-puppet.git. the grep pattern is actually grep -r -e 138.201.66.71 -e 2a01:4f8:172:1b46::2 -e 138.201.14. -e 2a01:4f8:172:1b46:0:abba: -e 172.30.131.
  9. cleaned from tor-passwords/hosts and hosts-extra-info
  10. deleted the textile worksheet in the spreadsheet, whoohoo! (it was empty)
  11. removed from nagios
  12. scheduled backup removal:
    root@bungei:/srv/backups/bacula# echo rm -rf textile.torproject.org-OLD/ | at now + 30 days
    warning: commands will be executed using /bin/sh
    job 18 at Fri Mar 13 15:51:00 2020
    
  13. N/A
  14. "canceled" the server with hetzner, "The earliest possible cancellation date is 17 February 2020."
  15. not a mail host

and we're all done, assuming hetzner does close down the server in 5 days. whoohoo!

comment:22 Changed 8 months ago by anarcat

i'll note that textile is gone from the robot control panel, which confirms its final retirement.

Note: See TracTickets for help on using tickets.