On 6/12, we started getting a lot of error messages from donate.torproject.org. Antoine took a look at it and determined that stunnel was the problem and restarted it and that seems to have fixed the problem.
When the stunnel is down, the donations don't get queued into Redis so the donations don't get recorded in the CRM and receipts don't get sent. Also mailing list signups don't get recorded. It's very time consuming to get that data into the CRM from the error emails.
We'd like to see if we can figure out what happened to cause the stunnel to stop working and then make it more reliable.
I have seen very small outages in the past.
Here's what I'm thinking we should try:
Could someone turn up the logging level for stunnel on crm-ext-01 to info (6) and if it's not already separated out into a different syslog file, separate it out so it's easy to debug.
Then the next time I'll get an error message about it, we can look through the logs and maybe that will give use some idea of what the problem is.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
My gut feeling on this is that stunnel probably isn't the problem. It's probably reacting to something that's happening in Redis. One idea I have is maybe we could get rid of stunnel? As far as I know Redis doesn't have any encryption, so I'd rather not just open it up on crm-int-01 to allow crm-ext-01 to connect to it. We could also try spiped which is what Redis recommends using when you want to connect to it remotely.
so last time this happened, I did look at the (stunnel? or redis? both? i forgot) logs but didn't see anything fishy. one thing that did happen, it seems, is that:
it looked like the stunnel process on crm-int-01 was gone. I restarted both stunnel processes and things seem to be back in order now.
So maybe the simplest fix would be to tweak the systemd .service file for stunnel to forcibly restart the process when it exits, regardless of the status. I think that by default, systemd doesn't always restart services on crash so this might fix the problem for us.
What I suspect happened is we rebooted boxes for security upgrades recently. Maybe that threw the stunnels out of whack... I don't know. We have more security reboots to performed soon, so I'll hold off on deploying a fix here to see if the reboot causes the problem.
spiped looks interesting (thanks! didn't know about that one before!) but I'm not sure it's the right solution now because it only introduces another "unknown" with similar properties as stunnel. if I would fix this another way, i would create an IPsec tunnel between the two machines, something we already have code to automatically deploy for (as opposed to spiped, which isn't used anywhere in TPA yet).
FWIW, I've rebooted everything today -- including the two CRM-* hosts -- and everything still seems to be up. Now that might be because the two hosts were rebooted at around the same time and they just happened to reset correctly, but that's one data point.
I'll look into this again next week, especially at your recommendations (thanks!).
restarting stunnel on ext-01 does nothing. restarting stunnel on int-01 fixes it. therefore, there might be something wrong on the int-01 side, some fd leak maybe?
You know, I like the idea of trying IPsec if that's not too hard for you guys to setup. Because it's lower level, it probably won't have connection issues if there's temporary problems with Redis.
I think also you should up the debugging level for stunnel on crm-int-01. I think you did it on crm-ext-01 last time, but now we think its crm-int that has the problem.
I've been emailing with Sarah about this issue and we're pretty worried about the impact these outages have on newsletter subscriptions when it's down because we don't have a good way to recover those. Also there's a cost to entering all the missed donations manually. So I think trying IPSEC might be worth it.
we're adding another check to our infra to detect this better, hopefully we'll be able to repair this more quickly the next time it breaks, as a stopgap while we figure out the best way forward. sorry for the delays, we know how important this is! :)
It's because we're using Redis as a message broker as well as a key/value store. I was surprised the first time I heard of this, too, but tons of people use Redis to pass messages between apps. It has a list data structure that turns out to be really good for message queues. It's nice because it's really lightweight compared to the other message brokers which makes it good for this kind of small application.
It looks like maybe stunnel went down again between about 3AM and 5AM PDT today. I'm guessing your new detection thing means you got it back up faster? If so, awesome! Anything in the stunnel logs on crm-int-01?
so the two CRM instances talk to each other? i know how redis can be used as a message cache, i was just wondering why we were passing messages between the two instances if we wanted those to be separate in the first place.
and yes, weasel found out about a crash and quickly fixed it during the night (~PDT). we haven't found the cause of the crash this time around either and nothing significant showed up in the logs.
The two instances talk to each other through Redis. crm-int-01 is running CiviCRM on top of Drupal which is storing all of Tor's donation information. Normally you setup CiviCRM on a public web server so that you can use its builtin public pages to accept donations, join mailing lists, etc... We don't really trust CiviCRM's security, so we wanted to not have it on a public server, but we still need a way to get the donation information into the database. Having the public donation site pass donation information to the internal site using Redis massively shrinks the normal CiviCRM security footprint.
I think if we didn't gather any information from this crash we either need to come up with some new way to gather information or we need to try one of the other options like ipsec or spiped.
an ipsec tunnel is now present between the two hosts. any time one host reaches the other, traffic is encrypted over the wire. so you can now connect directly to the Redis host safely, without going through the tunnel. this will require a configuration on your side, of course, but that should be easy enough to perform.
note that if we have a failure of the tunnel (ie. it doesn't start at all or is stopped by a malicious actor), it means redis will communicate to the other host in cleartext. we thought of some options to workaround that problem, like creating a RFC1918 IP address just for this purpose, but I figured I would try with you to see if this works first.
we should have monitoring on the tunnels to make sure they don't go down, so at least failures should be monitored (i'll double-check that).
thank you for your patience...
(we're in the middle of a mailing now, so i'd recommend waiting a little bit before making those changes... ;) i figured backend changes like this weren't a problem because they are inactive as long as you don't apply the changes on your side...)
Thanks for jumping on this so quickly. I'm not super familiar with ipsec, but I suspect in order to get this to work we'll need to have redis on crm-int-01 to listen on 0.0.0.0 or the inet6 address instead of localhost? Currently it's only bound to localhost. I definitely can't telnet to crm-int-01 3679 from crm-ext-01 right now.
Maybe we'll also want to setup an iptables rule or something so that only crm-ext-01 can connect to that port?
as it turns out, ipsec tunnels are not directly monitored, so we do need to have some stronger garantees here.
i added the following record into DNS:
crm-int-01-priv.torproject.org. 3600 IN A 172.30.136.1crm-ext-01-priv.torproject.org. 3600 IN A 172.30.136.2
which point to the two CRM instances, but on an internal network that shouldn't travel outside of the VPN.
you should therefore setup Redis on crm-ext to connect to crm-int-01-priv.torproject.org (or plain crm-int-01-priv) instead of localhost.
and yes, i had to reconfigure redis to listen on the new network.
we DEFINITELY do not want to make redis listen on all interfaces (0.0.0.0). even if we have a firewall in place, that could possibly be "very bad" if the firewall fails.
the tunnel works, in my tests:
crm-ext-01# echo PING | nc -s 172.30.136.2 -v -w 1 172.30.136.1 6379 Connection to 172.30.136.1 6379 port [tcp/*] succeeded!+PONG
note that i had to pick a specific source address -s 172.30.136.2 so that the firewall rules matches (because we allow only traffic from the tunnel). unfortunately, "outgoing" connexions go out with a source IP of 138.201.212.236 by default, so you have to tweak that. no idea if it would be possible to do so in Drupal/CiviCRM/PHP or whatever is going on up there. :)
as a worst case scenario (if the above fails for you), we might consent to punching a hole for 138.201.212.236 as well, but i'd rather avoid such hackery for now.
we'll also research if there are better ways to do firewalling with ipsec, if you can't figure out the source IP stuff...
last update: i figured out how to craft a firewall rule that restricts traffic as originating from the IPsec network, so you should all be good to go, without having to hack around your source IP.
you're now all clear to go and use the crm-int-01-priv.torproject.org redis endpoint on crm-ext-01. when you have done the switch, let me know and i'll tear down the old stunnel configuration.
Trac: Owner: tpa to peterh Status: needs_information to assigned
Awesome! I switched staging and the live site to use crm-int-01-priv.torproject.org. Right after you tear down stunnel, could you go to https://donate.torproject.org/subscribe and subscribe to the newsletter? That will send you an email and then when click on the link it will subscribe you. If you get to the Subcription Confirmed (/subscribed) page, then that means all the Redis stuff is working.
If it's not working you'll need to put the stunnel back up.
i ripped out the stunnel stuff, and things still seem to be working. so I guess we're good here - if the problem returns, don't hesitate to (re)open this bug report (or a fresh new one).
Trac: Resolution: N/Ato fixed Status: assigned to closed