Opened 8 months ago

Closed 8 months ago

Last modified 8 months ago

#33446 closed task (fixed)

migrate cupani/git-rw to the ganeti cluster, triggering an IP address change

Reported by: anarcat Owned by: anarcat
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords: tpa-roadmap-march
Cc: irl Actual Points:
Parent ID: #33085 Points:
Reviewer: irl Sponsor:

Description

i will soon migrate cupani AKA git-rw.torproject.org to the new ganeti cluster. this will involve an IP address change which might affect the service.

please let me know if there are any problems you can think of. in particular, do let me know if any internal (inside the server) or external (outside the server) services hardcodes the IP address of cupani.

thanks!

Child Tickets

Change History (17)

comment:1 Changed 8 months ago by irl

We should test that IRC notifications still work after the move. Other notifications and hooks seem to be authenticated in some way, but the IRC ones aren't so I suspect IP address based auth.

comment:2 Changed 8 months ago by anarcat

Cc: irl added

We should test that IRC notifications still work after the move.

That may sound like a stupid question, but how do I actually do that? :)

thanks!

comment:3 Changed 8 months ago by anarcat

Owner: changed from tpa to anarcat
Status: newaccepted

i started working on this, and imported the disks on fsn-node-03. still need to do ip renumbering and everything after step 6.

comment:4 Changed 8 months ago by anarcat

managed to boot the box. turns out disk order matters so i'll have to improve my inventory routine to list disks in the right order.

the other problem is that it's not finding /srv, but that might not be the migration script's fault, because the disk is available in the right order in the vm now. crypto?

comment:5 Changed 8 months ago by anarcat

after resyncing disks with the VM suspended, it seems we're in a better state and the VM booted without problems.

up next is local IP change and testing.

comment:6 Changed 8 months ago by anarcat

the ip has been changed and the new host is available for testing at 116.202.120.182 and 2a01:4f8:fff0:4f:266:37ff:fe5f:c6c6. All changes will be lost in the final sync.

comment:7 Changed 8 months ago by irl

Reviewer: irl
Status: acceptedneeds_revision
remote: ssh: connect to host vineale.torproject.org port 22: Connection refused
remote: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
remote: rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.2]

git-rw needs to be able to SSH to vineale.torproject.org directly, which I guess is an iptables thing but maybe it's more complicated.

comment:8 Changed 8 months ago by anarcat

Status: needs_revisionneeds_review

ah yes, that's because the jumphost configuration (which allows all hosts to talk to all others) is defined in puppet and exported based on the *current* ip address (which hasn't been updated in puppet yet). I expect this will fix itself after the migration is done for real.

for now, I fixed this by hand with this on vineale:

iptables -I INPUT -j ACCEPT -s 116.202.120.182

can you try again please?

even better: could you provide a quick few commands I could run to test this myself?

thanks!

comment:9 Changed 8 months ago by irl

Somehow the SSH key is missing to push to vineale, and the IRC message hook is broken (and at a guess this may also be affecting the email hook).

You can't test this unless you are part of the git team, because you won't have permission to push to the config repo. While this host may not be a production host, all the hosts that the triggers are talking to *are* production hosts.

I'm afk today at a meeting, and then it's the weekend, so this may be all the progress we make this week. Note that next week I only have 3 days, and then I'm back on the 23rd.

comment:10 Changed 8 months ago by anarcat

Somehow the SSH key is missing to push to vineale, and the IRC message hook is broken (and at a guess this may also be affecting the email hook).

I don't understand that diagnostic. What do you mean "missing SSH key" - the actual file is gone? Which path exactly? What error message are you getting?

What is the IRC message hook? How does it get fired?

It's hard for me to help you in those conditions, as I have no way of reproducing this problem in any form and you are providing very little information describing the actual issues you are having.

You can't test this unless you are part of the git team, because you won't have permission to push to the config repo.

Couldn't I add myself to the git team, at least at the technical level?

While this host may not be a production host, all the hosts that the triggers are talking to *are* production hosts.

I understand that we need to be careful. :)

I'm afk today at a meeting, and then it's the weekend, so this may be all the progress we make this week. Note that next week I only have 3 days, and then I'm back on the 23rd.

Okay well, I'll leave this on my desk until monday, but after that I'm tempted to just resync the VM and change the DNS and deal with the problems that result after... Otherwise we'll never get through with this. :)

And sorry you have to go through with those things, you're kind of my guinea pig of this procedure. I underestimated the trouble involved with the renumbering and server move and I am starting to see how much trouble that might entail. I will try to proceed differently for the next migrations.

comment:11 Changed 8 months ago by anarcat

Keywords: tpa-roadmap-march added; tpa-roadmap-february removed

comment:12 Changed 8 months ago by anarcat

i'm going to re-sync cupani-new, discarding all local changes.

local changes in /etc:

root@cupani-new:/etc# git diff --cached | cat
diff --git a/network/interfaces b/network/interfaces
index 68f9eec..4156ee4 100644
--- a/network/interfaces
+++ b/network/interfaces
@@ -6,11 +6,23 @@ auto lo
 iface lo inet loopback
 
 # The primary network interface
-allow-hotplug eth0
+#allow-hotplug eth0
+#iface eth0 inet static
+#    address 78.47.38.228/28
+#    gateway 78.47.38.225
+#iface eth0 inet6 static
+#    accept_ra 0
+#    address 2a01:4f8:211:6e8:0:823:4:1/96
+#    gateway 2a01:4f8:211:6e8:0:823:0:1
+
+auto eth0
 iface eth0 inet static
-    address 78.47.38.228/28
-    gateway 78.47.38.225
+    address 116.202.120.182/27
+    gateway 116.202.120.161
+
 iface eth0 inet6 static
     accept_ra 0
-    address 2a01:4f8:211:6e8:0:823:4:1/96
-    gateway 2a01:4f8:211:6e8:0:823:0:1
+    address 2a01:4f8:fff0:4f:266:37ff:fe5f:c6c6/64
+    gateway 2a01:4f8:fff0:4f::1
+#00:66:37:5f:c6:c6
+#      IP: 116.202.120.182

sudoers was also changed, but that shouldn't matter in the final migration:

diff --git a/sudoers b/sudoers
index 7052c06..149af16 100644
--- a/sudoers
+++ b/sudoers
@@ -53,6 +53,7 @@ letsencrypt		nevii=(dnsadm)				NOPASSWD: /srv/dns.torproject.org/bin/update
 %exonerator-web		materculae=(exonerator-web)		ALL
 %fpcentral		forrestii=(fpcentral)			ALL
 %gitolite		cupani=(git)				ALL
+%gitolite		cupani-new=(git)				ALL
 %gitweb			vineale=(gitweb)			ALL
 %metrics		meronense=(metrics)			ALL
 %onionoo		ONIONOOHOSTS=(onionoo)			ALL

comment:13 Changed 8 months ago by anarcat

i redid a sync without problems today, but i've removed the cloned machine. i'll finalize the migration tomorrow morning (UTC-4) and just fix the problems as they come along.

comment:14 Changed 8 months ago by anarcat

okay, back from the top, our new checklist:

  1. picked fsn-node-03 again
  2. done
  3. done
  4. done
  5. done, tests showed that IP address is hardcoded in many locations (#33586) and will need manual changes
  6. done, machine shutdown
  7. done
  8. redid migration
  9. tests done: seems like things generally work (including IRC and static pushes when firewall and keys are set just so)
  10. DRBD: done
  11. DNS changes:
    • nagios
    • LDAP
    • Puppet
    • hetzner robot reverse DNS
  12. more tests:
    • wiki push: works
    • jenkins: TODO
    • nagios push: works
    • vineale mirror: works
    • DNS/nevii changes: TODO
    • ud-replicate: works
  13. old machine retirement: todo
Last edited 8 months ago by anarcat (previous) (diff)

comment:15 Changed 8 months ago by anarcat

Resolution: fixed
Status: needs_reviewclosed

i'm going to go with the assertion that DNS and jenkins still work for now, unless proven otherwise.

i've also scheduled removal of the cupani disks from unifolium, and accidentally revoked the Puppet cert and scheduled the backup removal.

i reverted the backup removal job, and now i need to recreate a cert for cupani.

comment:16 Changed 8 months ago by anarcat

revocation procedures problems were discussed in #33587. i recreated a new cert by moving /var/lib/puppet/ssl aside on the client and rebootstrapping puppet. all done.

comment:17 Changed 8 months ago by hiro

I solved it for hetzner-nbg1-02.torproject.org as following:

While trying to just regenerate the certificate on the client I noticed puppet was throwing an error:

To fix this, remove the CSR from both the master and the agent and then start a puppet run, which will automatically regenerate a CSR.
On the master:
  puppet cert clean hetzner-nbg1-02.torproject.org
On the agent:
  1a. On most platforms: find /var/lib/puppet/ssl -name hetzner-nbg1-02.torproject.org.pem -delete
  1b. On Windows: del "\var\lib\puppet\ssl\certs\hetzner-nbg1-02.torproject.org.pem" /f
  2. puppet agent -t

So I cleaned the cert on the master too.

On the master:

puppet cert clean hetzner-nbg1-02.torproject.org

On the client:

find /var/lib/puppet/ssl -name hetzner-nbg1-02.torproject.org.pem -delete

Then:
Run the bootstrap script from tsa-misc/installer/puppet-bootstrap-client to get a new checksum

Again on the master:

tpa-puppet-sign-client

And pass the obtained checksum. The client will pick it up from there.

Run:

puppet agent -t

To have puppet running on the client again.

Last edited 8 months ago by hiro (previous) (diff)
Note: See TracTickets for help on using tickets.