Opened 18 months ago

Closed 4 months ago

Last modified 3 months ago

#29399 closed task (fixed)

Retire host and services for tordnsel and check (chiwui)

Reported by: ln5 Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords: tpa-roadmap-april
Cc: metrics-team, gaba Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Metrics team will re-implement the tordnsel and check services and have them deployment ready by end of March 2019. Once up on a new host, retire chiwui.tpo.

Child Tickets

TicketStatusOwnerSummaryComponent
#32553closedanarcatGroup details for exit scannerInternal Services/Tor Sysadmin Team
#32999closedanarcatAdd irl to the "check" and "tordnsel" LDAP groupsInternal Services/Tor Sysadmin Team
#33362closedanarcatPlease provision a VM for the new exit scannerInternal Services/Tor Sysadmin Team
#33510closedirlSend service change announcement for check/tordnselMetrics/Exit Scanner
#33569closedanarcatPoint check.torproject.org at check-01Internal Services/Tor Sysadmin Team

Change History (27)

comment:1 Changed 16 months ago by weasel

chiwui is no longer able to successfully communicate with our backup infrastructure since it's running ancient Debian.

Karsten, irl, what's the status of your reimplementation?

comment:2 Changed 16 months ago by ln5

History, for the record:

  • Metrics team asked for and got an extension until mid April.
  • They couldn't get the new implementation done in time for that deadline and looked into porting TorDNSEL to a more recent haskell. That failed.
  • Vegas team meeting early April said that "letting it die is not a good option"

comment:3 Changed 16 months ago by anarcat

Okay, so what's the plan then?

Supporting chuiwi is going to get harder and harder. We can probably afford to do so a little longer, but things are going to progressively break as we go along. Mow it's backups: there *might* be a way to backport things to chiwui to make them work, but it will be a waste of time if we get this fixed otherwise later anyways. But other things might break in the future as well...

For now, I've "acknowledged" the backups warnings in Nagios for this host, which means we will not fix backups for this host in the short term. I assume this is okay-ish: the older backups (from april 23rd) are still there and from what I understand the contents on that host are not changing (it's the problem we're trying to solve!).

Could the problem be split in two? Maybe "check" can be upgraded and not the other? Or are the two services as critical and inter-dependent?

For the record, someone mentioned "Docker" as a solution here, and I somehow disagree: it would certainly shift the burden of maintaining the jessie box away from us (TPA) but we would *still* have to maintain *some* environment with the older Haskell, which is the problem we're trying to solve in the first place.

It would allow us to upgrade the box and resume backups, so it's a possible alternative in the mid term, but it just shifts the upgrade problem under a container veil. I'm worried it would make us just forget about it and create another liability.

comment:4 Changed 16 months ago by irl

Cc: metrics-team gaba added

I am working on the reimplementation, with completion expected before the end of LTS.

We are doing the reimplementation properly instead of rushing it, which will avoid us having to panic again later.

Gaba indicated that I should not give the reimplementation high priority for now, as we may seek funding for it, and should do the work once it is funded.

There is no critical data on the machine as far as I know, if it dies then I doubt we would be able to bring it back from backups anyway due to not knowing how it works, so lack of backups is not really an issue.

We can't split the code as it is currently as the parts communicate with each other via the filesystem.

Docker sounds like an awful idea for this case.

comment:5 Changed 12 months ago by gaba

To clarify: we have this in the metrics team roadmap. We will try to discuss a more concrete plan and ETA and let you know.

comment:6 Changed 12 months ago by karsten

This is in our current roadmap. We're going to start in October and expect to be done by end of December.

comment:7 Changed 12 months ago by anarcat

awesome karsten, thanks! should we create a separate ticket for that or assign this one to someone or something?

comment:8 Changed 12 months ago by karsten

We're tracking work related to retiring the host in #29650 and its children.

comment:9 Changed 12 months ago by anarcat

awwwweeeesssooooooome! :)

comment:10 Changed 11 months ago by anarcat

Parent ID: #31686

comment:11 Changed 11 months ago by anarcat

Summary: Retire host and services for tordnsel and checkRetire host and services for tordnsel and check (chiwui)

comment:12 Changed 8 months ago by irl

Summary as we come to the end of the year:

  • We have an exitmap based scanner that produces comparable results to the current exit scanner.
  • We can (untested) run a cron job to fetch the output of this scanner to power check.tpo.
  • We do not currently have a replacement for the DNSBL portion of the service, which will block this for now.
  • In the new year, one of the first things I'd like to do is deploy the new exit scanner software to a TPA host. I will file a new ticket to request that host seperately.

comment:13 Changed 8 months ago by anarcat

whoohoo! thanks for the updates!

We do not currently have a replacement for the DNSBL portion of the service, which will block this for now.

What's the plan for that part?

comment:14 Changed 7 months ago by anarcat

we have a hard deadline of june 2020 here, at which point this host *will* be shutdown, along with the services hosted on it.

comment:15 Changed 6 months ago by anarcat

Parent ID: #31686

disconnecting from the textile shutdown (#31686) because i want to turn off that box sooner.

i'm still waiting on a plan here - the hard deadline remains.

comment:16 Changed 6 months ago by gaba

irl is working on it right now. The work got delay and the new date to finish this is mid February.

comment:17 Changed 6 months ago by anarcat

okay, thanks for the new estimate. hopefully this will be the last? :)

but i must warn again that we will not be able to support this machine past this summer, and it will be forcibly retired, with all the trouble that implies. :/

comment:18 Changed 5 months ago by anarcat

Owner: changed from tpa to irl
Status: newassigned

this is finally picking up speed, or rather, about to cross the finish line! a new host has been created for the service (check-01, #33362) and has been setup by irl and anarcat.

the next step here is to announce chiwui.torproject.org's retirement, which irl will handle.

thanks!

comment:19 Changed 5 months ago by irl

We plan to turn off chiwui on the 1st April 2020.

comment:20 Changed 5 months ago by anarcat

can we do april 2nd instead? i try to avoid symbolic dates like this, in case someone doesn't believe us because "it's april fool's day"...

comment:21 Changed 4 months ago by anarcat

should we do this now? is chiwui finally ready for decomissionning?

comment:22 Changed 4 months ago by irl

Please begin the decomissionning process.

comment:23 Changed 4 months ago by anarcat

Owner: changed from irl to anarcat
Status: assignedaccepted

great, will start the retirement process now.

comment:24 Changed 4 months ago by anarcat

following https://help.torproject.org/tsa/howto/retire-a-host/

  1. announced here and elsewhere numerous times before
  2. removed from nagios
  3. stopped chiwui in the ganeti cluster

Will wait a few days to see if things blow up. Service is now stopped, and the rest of the retirement process will follow soon.

comment:25 Changed 4 months ago by anarcat

Resolution: fixed
Status: acceptedclosed

step 4 done

data removal scheduled everywhere:

anarcat@curie:tsa-misc(master)$ ./retire -v -H chiwui.torproject.org retire-all --parent-host=fsn-node-01.torproject.org
starting tasks at 2020-04-09 16:39:51.630866
checking for ganeti master on host fsn-node-01.torproject.org
ganeti node detected with master fsn-node-01.torproject.org
checking on fsn-node-01.torproject.org if instance chiwui.torproject.org is running
instance chiwui.torproject.org not running, no stop required
scheduling chiwui.torproject.org instance removal on host fsn-node-01.torproject.org
scheduling gnt-instance remove chiwui.torproject.org to run on fsn-node-01.torproject.org in 7 days
warning: commands will be executed using /bin/sh
job 10 at Thu Apr 16 20:39:00 2020
scheduling chiwui.torproject.org backup disks removal on host bungei.torproject.org
checking for path "/srv/backups/bacula/chiwui.torproject.org/" on bungei.torproject.org
scheduling rm -rf "/srv/backups/bacula/chiwui.torproject.org/" to run on bungei.torproject.org in 30 days
warning: commands will be executed using /bin/sh
job 24 at Sat May  9 20:40:00 2020
Error: The certificate retrieved from the master does not match the agent's private key. Did you forget to run as root?
Certificate fingerprint: 59:C4:A7:B7:3C:DD:A2:04:61:92:5B:35:97:03:66:64:1D:3C:55:85:DF:2E:40:BA:2B:3D:E2:A1:D2:11:2F:F5
To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certificate.
On the master:
  puppet cert clean pauli.torproject.org
On the agent:
  1a. On most platforms: find /home/anarcat/.puppet/etc/ssl -name pauli.torproject.org.pem -delete
  1b. On Windows: del "\home\anarcat\.puppet\etc\ssl\certs\pauli.torproject.org.pem" /f
  2. puppet agent -t

Error: Try 'puppet help node clean' for usage
failed to revoke instance pauli.torproject.org on host chiwui.torproject.org: Encountered a bad command exit code!

Command: 'puppet node clean chiwui.torproject.org'

Exit code: 1

Stdout: already printed

Stderr: already printed


completed tasks, elasped: 0:00:12.384885 (user 2.74 system 0.05 chlduser 0.0 chldsystem 0.0 RSS 34.9 MB)
anarcat@curie:tsa-misc(master)$ ./retire -v -H chiwui.torproject.org retire-all --backup_host='' 
starting tasks at 2020-04-09 16:41:08.346772
No idea what '--backup_host' is!
completed tasks, elasped: 0:00:00.002826 (user 0.21 system 0.03 chlduser 0.0 chldsystem 0.0 RSS 30.8 MB)
[1]anarcat@curie:tsa-misc(master)$ ./retire -v -H chiwui.torproject.org retire-all --backup-host='' 
starting tasks at 2020-04-09 16:41:13.611470
not wiping instance chiwui.torproject.org data: no parent host
Notice: Revoked certificate with serial 23
Notice: Removing file Puppet::SSL::Certificate chiwui.torproject.org at '/var/lib/puppet/ssl/ca/signed/chiwui.torproject.org.pem'
chiwui.torproject.org
Submitted 'deactivate node' for chiwui.torproject.org with UUID 84ccf106-f275-4f7e-8571-d414a47a4a3d
completed tasks, elasped: 0:00:08.504086 (user 3.09 system 0.05 chlduser 0.0 chldsystem 0.0 RSS 34.2 MB)

note that in the above the puppet run failed because it tried to connect using a normal user. this was worked around in 4d025f3 and reran correctly.

step 5

removed this block from LDAP:

269 host=chiwui,ou=hosts,dc=torproject,dc=org
host: chiwui
hostname: chiwui.torproject.org
objectClass: top
objectClass: debianServer
architecture: amd64
access: restricted
admin: torproject-admin@torproject.org
sshRSAHostKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUKfP+b2Isj3UlWmVRAeXpOcyZslJypugDdunLUWXsx2IjzKzhExqkgiDigsv0Fr7SbFKuJSBmZM/q0X6iLXUAuTPDREhubMcQ9iGONvh26H/ocniXpgtbBzzZ8d6sDK/NLupOXHjfBXN/IWhCdwN/JC6lm1qjLAf5BQ7ukVeVKt7gBXXW4rGUkCw+eWLFS1IjKWASm9ubE9t+uVaoYeUP0PSwSrgIrb9hjCsMHBFTOXvSgrX2Nr85ZUetUPvHyo/GPUIdteK8ouMrRe4yJi6rIyMeze2a7ohtEJ2q1IDaE3Jr5BlzIyXeEK+LN1VykiiChde0pGbInzHWzgk8wi3R root@chiwui
sshRSAHostKey: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILDW4yvM1jKFwZpSMHl/+HqPsLA2H58w028TmHQ5Zmqu root@chiwui
distribution: Debian
allowedGroups: check
allowedGroups: tordnsel
purpose: [[check.torproject.org]]
purpose: tordnsel
l: Falkenstein, Saxony, Germany
dnsTTL: 300
ipHostNumber: 116.202.120.176
ipHostNumber: 2a01:4f8:fff0:4f:266:37ff:fe69:3bda
physicalHost: gnt-fsn

step 6

removed the following DNS records:

exitlist		IN	NS	chiwui4
chiwui2			IN	A	116.202.120.177
chiwui4			IN	A	116.202.120.176

or, in other words, this commit in dns/domains.git:

commit f61867cdd2832444c1b3abe0e74a21f6e5e74f05 (HEAD -> master)
Author: Antoine Beaupré <anarcat@debian.org>
Date:   Thu Apr 9 16:49:40 2020 -0400

    retire chiwui (#29399)

diff --git a/torproject.org b/torproject.org
index 8ab0832..241a9ca 100644
--- a/torproject.org
+++ b/torproject.org
@@ -83,7 +83,6 @@ dip                   IN      CNAME   gitlab-02
 donate                 IN      CNAME   crm-ext-01
 staging.donate         IN      CNAME   crm-ext-01
 test.donate            IN      CNAME   crm-ext-01
-exitlist               IN      NS      chiwui4
 exonerator             IN      CNAME   materculae
 gitlab    IN CNAME gitlab-02
 gettor                 IN      CNAME   static
@@ -202,8 +201,6 @@ $INCLUDE "/srv/letsencrypt.torproject.org/var/hook/snippet"
 macppc                 IN      A       50.195.45.81 ;old ip 74.95.122.145
 macx86                 IN      A       50.195.45.82 ;old ip 74.95.122.149
 watsoni                        IN      A       50.195.45.86
-chiwui2                        IN      A       116.202.120.177
-chiwui4                        IN      A       116.202.120.176
 
 ; internal networks
 macrum-priv            IN      A       172.30.133.1

remove the following sudo entries:

%check			chiwui=(check)				ALL
%tordnsel		chiwui=(tordnsel)			ALL
%check		chiwui=(root)	/usr/local/sbin/apache2-vhost-update

or, in other words, this commit in puppet:

commit 66a02f3b4361167bfe45bd85361826a0b5076efd (HEAD -> master)
Author: Antoine Beaupré <anarcat@debian.org>
Date:   Thu Apr 9 16:48:41 2020 -0400

    retire chiwui (#29399)

diff --git a/modules/nagios/templates/obsolete-packages-ignore.d-hostspecific.erb b/modules/nagios/templates/obsolete-packages-ignore.d-hostspecific.erb
index 3b727533..60801d2a 100644
--- a/modules/nagios/templates/obsolete-packages-ignore.d-hostspecific.erb
+++ b/modules/nagios/templates/obsolete-packages-ignore.d-hostspecific.erb
@@ -7,7 +7,6 @@ ignore = []
 case @fqdn
 when "alberti.torproject.org" then           ignore << %w{userdir-ldap userdir-ldap-cgi}
 when "moly.torproject.org" then              ignore << %w{megacli}
-when "chiwui.torproject.org" then            ignore << %w{tor prometheus-node-exporter}
 end
 
 ignore.flatten.join("\n")
diff --git a/modules/roles/manifests/check.pp b/modules/roles/manifests/check.pp
deleted file mode 100644
index 51e0fc4c..00000000
--- a/modules/roles/manifests/check.pp
+++ /dev/null
@@ -1,35 +0,0 @@
-# deprecated, to be replaced by roles::check_rewrite
-class roles::check {
-       include apache2
-       include apache2::ssl
-       ssl::service { 'check.torproject.org': notify  => Exec['service apache2 reload'], key => true, }
-
-       ferm::rule{
-               "tordnsel-exit":
-                       description     => "Allow tordnsel exit queries",
-                       rule            => "&SERVICE(tcp, (8000 10080 10443 10110 5190 6667 6697 9030))",
-                       ;
-               "tordnsel-dns":
-                       description     => "Allow tordnsel dns queries",
-                       rule            => "&TCP_UDP_SERVICE(10053)",
-                       ;
-               # XXX MAGIC-IP-ADDRESS
-               "do-track":
-                       domain      => '(ip)',
-                       description => 'do TRACK for tordnsel traffic',
-                       table       => 'raw',
-                       chain       => 'PREROUTING',
-                       rule        => 'daddr 116.202.120.177 proto tcp dport (http https) jump RETURN',
-                       ;
-               "tor-nat":
-                       description     => "redirect some incoming to high ports",
-                       table           => 'nat',
-                       chain           => 'PREROUTING',
-                       rule            => 'daddr 116.202.120.177 proto tcp dport  80 DNAT to :10080;
-                                           daddr 116.202.120.177 proto tcp dport 443 DNAT to :10443;
-                                           daddr 116.202.120.177 proto tcp dport 110 DNAT to :10110;
-                                           daddr 116.202.120.176 proto udp dport  53 DNAT to :10053;
-                                           daddr 116.202.120.176 proto tcp dport  53 DNAT to :10053 ',
-                       ;
-       }
-}
diff --git a/modules/sudo/files/sudoers b/modules/sudo/files/sudoers
index 7052c067..a1b7c52f 100644
--- a/modules/sudo/files/sudoers
+++ b/modules/sudo/files/sudoers
@@ -44,7 +44,6 @@ letsencrypt           nevii=(dnsadm)                          NOPASSWD: /srv/dns.torproject.org/bin/update
 %atlas                 STATICMASTER=(atlas)                    ALL
 %bridgedb              polyanthum=(bridgedb,bridgescan)                        ALL
 %buildmasters          rouyi=(jenkins)                         ALL
-%check                 chiwui=(check)                          ALL
 %collector             COLLECTORHOSTS=(collector)              ALL
 %consensus-health      henryi=(consensus-health)               ALL
 %dip                   gitlab-01=(git)                         ALL
@@ -63,7 +62,6 @@ letsencrypt           nevii=(dnsadm)                          NOPASSWD: /srv/dns.torproject.org/bin/update
 %rtfolks               rude=(rtmailarchive)                    ALL
 %torarchive            archive-01=(torarchive)                 ALL
 %tordebadm             palmeri=(tordeb)                        ALL
-%tordnsel              chiwui=(tordnsel)                       ALL
 %torhelp               STATICMASTER=(torhelp)                  ALL
 %tormedia              listera=(tormedia)                      ALL
 %torperf               ferrinii=(torperf)                      ALL
@@ -122,7 +120,6 @@ noc         peninsulare=(root)      ALL
 
 # various roles can do other interesting things
 %bridgedb      polyanthum=(root)               /usr/local/sbin/apache2-vhost-update
-%check         chiwui=(root)   /usr/local/sbin/apache2-vhost-update
 %rtfolks       rude=(root)             /usr/local/sbin/apache2-vhost-update
 
 %buildmasters          rouyi=(root)                            /usr/sbin/service jenkins *

step 7

removed from tor-passwords

step 8

DNSWL N/A

step 9

removed from spreadsheet

step 10

N/A

step 11

remove from reverse DNS in hetzner.

we're all done here, good bye chiwui, you served us well!

thanks to the metrics team and special thanks for irl for finally bringing us to this point, you rock! :)

comment:26 Changed 3 months ago by anarcat

for some reason the gnt-instance remove never ran on fsn-node-01, i ran it by hand now.

comment:27 Changed 3 months ago by anarcat

Keywords: tpa-roadmap-april added
Note: See TracTickets for help on using tickets.