Investigate stunnel outage on crm-ext-01

added component::internal services/tor sysadmin team owner::anarcat priority::medium resolution::fixed severity::normal status::closed type::defect labels

My gut feeling on this is that stunnel probably isn't the problem. It's probably reacting to something that's happening in Redis. One idea I have is maybe we could get rid of stunnel? As far as I know Redis doesn't have any encryption, so I'd rather not just open it up on crm-int-01 to allow crm-ext-01 to connect to it. We could also try spiped which is what Redis recommends using when you want to connect to it remotely.

so last time this happened, I did look at the (stunnel? or redis? both? i forgot) logs but didn't see anything fishy. one thing that did happen, it seems, is that:

it looked like the stunnel process on crm-int-01 was gone. I restarted both stunnel processes and things seem to be back in order now.

So maybe the simplest fix would be to tweak the systemd .service file for stunnel to forcibly restart the process when it exits, regardless of the status. I think that by default, systemd doesn't always restart services on crash so this might fix the problem for us.

What I suspect happened is we rebooted boxes for security upgrades recently. Maybe that threw the stunnels out of whack... I don't know. We have more security reboots to performed soon, so I'll hold off on deploying a fix here to see if the reboot causes the problem.

spiped looks interesting (thanks! didn't know about that one before!) but I'm not sure it's the right solution now because it only introduces another "unknown" with similar properties as stunnel. if I would fix this another way, i would create an IPsec tunnel between the two machines, something we already have code to automatically deploy for (as opposed to spiped, which isn't used anywhere in TPA yet).

FWIW, I've rebooted everything today -- including the two CRM-* hosts -- and everything still seems to be up. Now that might be because the two hosts were rebooted at around the same time and they just happened to reset correctly, but that's one data point.

I'll look into this again next week, especially at your recommendations (thanks!).

This change is now online:

Notice: /Stage[main]/Roles::Civicrm_ext_2018/Stunnel4::Client[civicrm-redis]/Stunnel4::Generic[civicrm-redis]/File[/etc/stunnel/puppet-civicrm-redis.conf]/content: 
--- /etc/stunnel/puppet-civicrm-redis.conf	2018-12-04 21:50:07.091358516 +0000
+++ /tmp/puppet-file20190625-4997-sxgs5o	2019-06-25 19:46:34.976126772 +0000
@@ -16,7 +16,7 @@
 CAfile = /etc/stunnel/puppet-civicrm-redis-peer.pem
 
 ; Some debugging stuff useful for troubleshooting
-debug = notice
+debug = info
 ; don't use a file, use syslog
 ; output = /var/log/stunnel4/stunnel.log

so we might get more info if stunnel crashes again. Do let us know (here) when/if that happens again.

Trac:
Status: new to needs_information

It happened again. The first notification I got was at 2019-07-08 19:12 PDT. I made ticket https://trac.torproject.org/projects/tor/ticket/31119 to get it restarted.

details from #31119 (moved) debugging...

daemon.log on crm-ext-01:

Jul  9 16:21:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17173]: Service [civicrm-redis-server] accepted connection from 127.0.0.1:55102
Jul  9 16:21:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17173]: failover: round-robin, starting at entry #1
Jul  9 16:21:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17173]: s_connect: connecting 138.201.212.235:16379
Jul  9 16:21:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17173]: s_connect: connected 138.201.212.235:16379
Jul  9 16:21:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17173]: Service [civicrm-redis-server] connected remote server from 138.201.212.236:54124
Jul  9 16:21:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17173]: SNI: sending servername: crm-int-01.torproject.org
Jul  9 16:21:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17173]: Peer certificate required
Jul  9 16:21:59 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17174]: Service [civicrm-redis-server] accepted connection from 127.0.0.1:55110
Jul  9 16:21:59 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17174]: failover: round-robin, starting at entry #0
Jul  9 16:21:59 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17174]: s_connect: connecting 2a01:4f8:172:39ca:0:dad3:11:1:16379
Jul  9 16:21:59 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17174]: s_connect: connected 2a01:4f8:172:39ca:0:dad3:11:1:16379
Jul  9 16:21:59 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17174]: Service [civicrm-redis-server] connected remote server from 2a01:4f8:172:39ca:0:dad3:12:1:33678
Jul  9 16:21:59 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17174]: SNI: sending servername: crm-int-01.torproject.org
Jul  9 16:21:59 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17174]: Peer certificate required
Jul  9 16:22:13 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17165]: ssl_start: s_poll_wait: TIMEOUTbusy exceeded: sending reset
Jul  9 16:22:13 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17165]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
Jul  9 16:22:41 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17166]: ssl_start: s_poll_wait: TIMEOUTbusy exceeded: sending reset
Jul  9 16:22:41 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17166]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
Jul  9 16:22:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17167]: ssl_start: s_poll_wait: TIMEOUTbusy exceeded: sending reset
Jul  9 16:22:58 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17167]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
Jul  9 16:23:11 crm-ext-01/crm-ext-01 stunnel[25094]: LOG6[17168]: ssl_start: s_poll_wait: TIMEOUTbusy exceeded: sending reset
Jul  9 16:23:11 crm-ext-01/crm-ext-01 stunnel[25094]: LOG5[17168]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket

the crm-int-01 perspective's:

Jul  9 16:21:58 crm-int-01/crm-int-01 stunnel[25139]: LOG5[1138]: Service [civicrm-redis-server] accepted connection from ::ffff:138.201.212.236:54124
Jul  9 16:21:59 crm-int-01/crm-int-01 stunnel[25139]: LOG5[1139]: Service [civicrm-redis-server] accepted connection from 2a01:4f8:172:39ca:0:dad3:12:1:33678

AKA "all is well here".

redis is up on int-01:

root@crm-int-01:~# echo PING | nc localhost 6379
+PONG

but hangs on ext-01:

root@crm-ext-01:~# echo PING | nc localhost 6379
[nothing]

restarting stunnel on ext-01 does nothing. restarting stunnel on int-01 fixes it. therefore, there might be something wrong on the int-01 side, some fd leak maybe?

next debugging step is to restart int-01 only.

You know, I like the idea of trying IPsec if that's not too hard for you guys to setup. Because it's lower level, it probably won't have connection issues if there's temporary problems with Redis.

I think also you should up the debugging level for stunnel on crm-int-01. I think you did it on crm-ext-01 last time, but now we think its crm-int that has the problem.

logging: good idea, done!

ipsec: will consider if we can't find another option, it's a little more complicated to setup. :)

I've been emailing with Sarah about this issue and we're pretty worried about the impact these outages have on newsletter subscriptions when it's down because we don't have a good way to recover those. Also there's a cost to entering all the missed donations manually. So I think trying IPSEC might be worth it.

we're adding another check to our infra to detect this better, hopefully we'll be able to repair this more quickly the next time it breaks, as a stopgap while we figure out the best way forward. sorry for the delays, we know how important this is! :)

why can't we simply run two redis instances? do they absolutely have to be interconnected?

It's because we're using Redis as a message broker as well as a key/value store. I was surprised the first time I heard of this, too, but tons of people use Redis to pass messages between apps. It has a list data structure that turns out to be really good for message queues. It's nice because it's really lightweight compared to the other message brokers which makes it good for this kind of small application.

It looks like maybe stunnel went down again between about 3AM and 5AM PDT today. I'm guessing your new detection thing means you got it back up faster? If so, awesome! Anything in the stunnel logs on crm-int-01?

so the two CRM instances talk to each other? i know how redis can be used as a message cache, i was just wondering why we were passing messages between the two instances if we wanted those to be separate in the first place.

and yes, weasel found out about a crash and quickly fixed it during the night (~PDT). we haven't found the cause of the crash this time around either and nothing significant showed up in the logs.

The two instances talk to each other through Redis. crm-int-01 is running CiviCRM on top of Drupal which is storing all of Tor's donation information. Normally you setup CiviCRM on a public web server so that you can use its builtin public pages to accept donations, join mailing lists, etc... We don't really trust CiviCRM's security, so we wanted to not have it on a public server, but we still need a way to get the donation information into the database. Having the public donation site pass donation information to the internal site using Redis massively shrinks the normal CiviCRM security footprint.

I think if we didn't gather any information from this crash we either need to come up with some new way to gather information or we need to try one of the other options like ipsec or spiped.

thanks for the update, i'll see what we can do for the next step...

an ipsec tunnel is now present between the two hosts. any time one host reaches the other, traffic is encrypted over the wire. so you can now connect directly to the Redis host safely, without going through the tunnel. this will require a configuration on your side, of course, but that should be easy enough to perform.

note that if we have a failure of the tunnel (ie. it doesn't start at all or is stopped by a malicious actor), it means redis will communicate to the other host in cleartext. we thought of some options to workaround that problem, like creating a RFC1918 IP address just for this purpose, but I figured I would try with you to see if this works first.

we should have monitoring on the tunnels to make sure they don't go down, so at least failures should be monitored (i'll double-check that).

thank you for your patience...

(we're in the middle of a mailing now, so i'd recommend waiting a little bit before making those changes... ;) i figured backend changes like this weren't a problem because they are inactive as long as you don't apply the changes on your side...)

Thanks for jumping on this so quickly. I'm not super familiar with ipsec, but I suspect in order to get this to work we'll need to have redis on crm-int-01 to listen on 0.0.0.0 or the inet6 address instead of localhost? Currently it's only bound to localhost. I definitely can't telnet to crm-int-01 3679 from crm-ext-01 right now.

Maybe we'll also want to setup an iptables rule or something so that only crm-ext-01 can connect to that port?

as it turns out, ipsec tunnels are not directly monitored, so we do need to have some stronger garantees here.

i added the following record into DNS:

crm-int-01-priv.torproject.org.	3600 IN	A	172.30.136.1
crm-ext-01-priv.torproject.org. 3600 IN A       172.30.136.2

which point to the two CRM instances, but on an internal network that shouldn't travel outside of the VPN.

you should therefore setup Redis on crm-ext to connect to crm-int-01-priv.torproject.org (or plain crm-int-01-priv) instead of localhost.

and yes, i had to reconfigure redis to listen on the new network.

we DEFINITELY do not want to make redis listen on all interfaces (0.0.0.0). even if we have a firewall in place, that could possibly be "very bad" if the firewall fails.

the tunnel works, in my tests:

crm-ext-01# echo PING | nc -s 172.30.136.2 -v -w 1 172.30.136.1 6379 
Connection to 172.30.136.1 6379 port [tcp/*] succeeded!
+PONG

note that i had to pick a specific source address -s 172.30.136.2 so that the firewall rules matches (because we allow only traffic from the tunnel). unfortunately, "outgoing" connexions go out with a source IP of 138.201.212.236 by default, so you have to tweak that. no idea if it would be possible to do so in Drupal/CiviCRM/PHP or whatever is going on up there. :)

as a worst case scenario (if the above fails for you), we might consent to punching a hole for 138.201.212.236 as well, but i'd rather avoid such hackery for now.

we'll also research if there are better ways to do firewalling with ipsec, if you can't figure out the source IP stuff...

thank you for your patience!

last update: i figured out how to craft a firewall rule that restricts traffic as originating from the IPsec network, so you should all be good to go, without having to hack around your source IP.

you're now all clear to go and use the crm-int-01-priv.torproject.org redis endpoint on crm-ext-01. when you have done the switch, let me know and i'll tear down the old stunnel configuration.

Trac:
Owner: tpa to peterh
Status: needs_information to assigned

Awesome! I switched staging and the live site to use crm-int-01-priv.torproject.org. Right after you tear down stunnel, could you go to https://donate.torproject.org/subscribe and subscribe to the newsletter? That will send you an email and then when click on the link it will subscribe you. If you get to the Subcription Confirmed (/subscribed) page, then that means all the Redis stuff is working.

If it's not working you'll need to put the stunnel back up.

awesome, thanks for the integration test, i'll take a look.

Trac:
Owner: peterh to anarcat

i ripped out the stunnel stuff, and things still seem to be working. so I guess we're good here - if the problem returns, don't hesitate to (re)open this bug report (or a fresh new one).

Trac:
Resolution: N/A to fixed
Status: assigned to closed

closed

mentioned in issue #31119 (moved)

Investigate stunnel outage on crm-ext-01

Child items 0

Activity