dgoulet/gk/somebody, can you get us a set of currently running relays and contactinfos, maybe sorted like last time, so we can go through and contact these relays asap?
Nothing is broken on these relays, yet. But you're right, we should encourage them to upgrade. There are still ~650 relays on 0.2.9, and maybe 100 on 0.4.0.
fix a comment to explain what happened to 0.3.6.0-alpha-dev
add tests for current releases
work out if we want to reject 0.4.0 alphas and rcs. To do that, we need to check how many there are in the network.
Rebase on 0.4.1, because we might backport this branch to 0.4.1
Once we've made these changes, we should email all the affected operators, and then test this patch on moria1. After we've done both those things, we can merge.
Rebasing this branch over 0.4.1 doesn't make sense, as the tests are missing on that branch. However, I have a secondary PR for 0.4.1.
About blocking 0.4.1 alphas/rc, I would be against it unless there is a crippling bug on the 0.4.1 alphas/rc releases. For instance, 0.2.9.5-rc is allowed on the network because it "works" (may not be the most secure, but technically works), whereas 0.2.9.4 and below have a consensus bug.
About blocking 0.4.1 alphas/rc, I would be against it unless there is a crippling bug on the 0.4.1 alphas/rc releases. For instance, 0.2.9.5-rc is allowed on the network because it "works" (may not be the most secure, but technically works), whereas 0.2.9.4 and below have a consensus bug.
Yes, we decided to keep 0.2.9.5-rc, when we made a decision about 0.2.9.
I don't know if two PRs will work for you, but here it is.
We normally do backports with multiple PRs. Because of the way that git does merges, the best way to get a good merge is to do the common changes in one commit, and the extra changes for later versions in other commits. You won't need to do that here, this patch is small enough that we can resolve any merge conflicts ourselves.
To move forward with this ticket, we need to:
find out how many relays are on unstable 0.4.0 versions
Changing title to reduce the odds of an accidental early merge.
Trac: Summary: Reject 0.2.9 and 0.4.0 in dirserv_rejects_tor_version() to Reject 0.2.9 and 0.4.0 in dirserv_rejects_tor_version() [DO NOT MERGE BEFORE 2020]
The merge should happen on or after 2 February 2020.
Trac: Summary: Reject 0.2.9 and 0.4.0 in dirserv_rejects_tor_version() [DO NOT MERGE BEFORE 2020] to Reject 0.2.9 and 0.4.0 in dirserv_rejects_tor_version() [DO NOT MERGE BEFORE FEB 2020]
0.4.4 is open now, and 0.2.9 and 0.4.0 are no longer recommended versions (#33130 (moved), about 2 weeks ago).
Do you think it's time to merge this change into master?
Trac: Summary: Reject 0.2.9 and 0.4.0 in dirserv_rejects_tor_version() [DO NOT MERGE BEFORE FEB 2020] to Reject 0.2.9 and 0.4.0 in dirserv_rejects_tor_version()
I'd like sign-off from arma and/or gk (in his role as network health lead) before going ahead with this: arma usually likes to double-check how much of the network we're about to lose.
We should look a bit at the data first and try to reach (again) affected operators.
nickm: assuming we want to have this in 0.4.4, what is the latest date we need to make a decision here (not taking into account that the new 0.4.4/older versions with a backported patch need to get released and deployed first)? (That is: how much time do we have left to think about the potential impact on relay bandwidth/diversity etc. and try different means to reach affected operators?)
Attached is the output of a new script run compared to what dgoulet did in comment:16 about two months ago. Relays are included if they are running 0.2.9.x or 0.4.0.x or 0.3.y (y != 5) AND have some kind of ContactInfo. (It seems the entries to consider went down from about 680 to around 400 compared to the previous run, which is great)
I am working on a next iteration of the output that removes all the duplicated ContactInfo entries so we get a better list of whom to contact and an understanding of how large that contact effort would be and where the low-hanging fruits were (depending on how we would like to proceed).
Meanwhile, one thing to think about is whether we want to treat 0.2.9 and 0.4.0 differently or what criteria for treating them differently could be. That is, speaks there anything for, say, keeping 0.4.0 around for a bit longer while rejecting 0.2.9 asap (or vice versa)?
The last release of 0.2.9 was 13 months ago. Even before then, we didn't backport all our fixes to 0.2.9. And there have been no new features in 0.2.9 for 3-4 years.
0.4.0's last release was a month or two ago, it's reasonably up-to-date.
If we patch any security issues, we won't patch 0.2.9 or 0.4.0. If we decide that a security fix is required, we might need to reject them straight after the release of that fix. We don't really control the timing of security fixes.
This ticket is a directory authority change. We don't support directory authorities older than the last two stable releases. That's 0.4.1 right now, and it will be 0.4.2 by the time 0.4.4 is in feature freeze.
Attached is the output of a new script run compared to what dgoulet did in comment:16 about two months ago. Relays are included if they are running 0.2.9.x or 0.4.0.x or 0.3.y (y != 5) AND have some kind of ContactInfo. (It seems the entries to consider went down from about 680 to around 400 compared to the previous run, which is great)
I am working on a next iteration of the output that removes all the duplicated ContactInfo entries so we get a better list of whom to contact and an understanding of how large that contact effort would be and where the low-hanging fruits were (depending on how we would like to proceed).
Okay, attached is a new version that is down to essentially 276 entries (we know the situation of the smell relays and I got the DFRI folks to upgrade their relays meanwhile, in addition to an operator who was previously on 0.4.0.5 and is still there in my previous attachment), weighted by bw (I added the bw of all relays per ContactInfo in case there is more than one relay but did not bother to show all fingerprints involved).
We could now think about grepping for the fingerprints and start taking a random number from the top and contact the operators. I am waiting for ggus here so we can coordinate.
Trac: Cc: dgoulet, gk, neel, arma to dgoulet, gk, neel, arma, ggus
nickm: assuming we want to have this in 0.4.4, what is the latest date we need to make a decision here (not taking into account that the new 0.4.4/older versions with a backported patch need to get released and deployed first)? (That is: how much time do we have left to think about the potential impact on relay bandwidth/diversity etc. and try different means to reach affected operators?)
Our feature freeze date for 0.4.4 is May 15, but I would like to have these versions off the network sooner than that if we can.
I think we should aim to contact the affected relay operators soon, and measure what effect that has. If it helps, we can try doing it more -- but it may be that we don't see much effect, and the right thing to do is just to disable these versions.
Teor notes:
If we patch any security issues, we won't patch 0.2.9 or 0.4.0. If we decide that a security fix is required, we might need to reject them straight after the release of that fix. We don't really control the timing of security fixes.
Right, and the kind of security bug that we run into is important. If (heaven forbid) we find an RCE issue, or a memory exposure issue, we'll need everybody to upgrade asap, with no delays, and no excuses. If we run into a remote crash or CPU DoS issue, then we still want everybody to upgrade, since the issue would have potential to make traffic analysis easier, but it wouldn't be under as much time pressure as a critical-severity issue would be.
So, looking at historical migration curves for older deprecated releases, it seems like if we did anything that had a major effect in making people upgrade, the effect was fairly abrupt -- like, within the space of a week. So this would imply that we don't need to do a long experiment here: ideally, we send out a large first batch of emails, and see whether they have a measurable effect within a week or two. If they do, we can email everybody else and wait another week or two, then deploy this patch.
(I'd suggest maybe doing some kind of a randomized trial here, if you have the time and energy.)
We need to check if this patch affects bridges. As far as I recall, the last patch affected bridges as well as relays.
It does. This is called around a authdir_mode() and thus affects Bridge and Directory authorities.
It is called basically when we load the fingerprints from the approved router list and when we consider descriptors to vote on. Both are looking at AuthoritativeDir.
Hi, yesterday I contacted organizations and relay operators from Brazil, Chile, Mexico, Costa Rica. I sent localized email in Spanish and Portuguese, some people replied and upgraded their relays.
GeKo, we could select relay operators from two or three countries, and contact them today (Friday) or Wednesday. I'll be offline next Monday and Tuesday, so I can't follow up if they have questions.
And here's Roger email that we can use:
Hi,You are running a Tor relay, which is great:http://rougmnvswfsmd4dq.onion/rs.html#details/$fingerprintBut that Tor version is obsolete, and because of old bugs, we will sooncut relays running those versions out of the network. Please considerupgrading!You can find Tor packages and instructions for your distro / OS here:https://community.torproject.org/relay/setup/guard/Ideally you will switch to keeping up with our stable releases, but ifyou need a stable that is especially stable, the Tor 0.3.5 branch willbe maintained until Feb 2022:https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/CoreTorReleases#Currentand you can see the lifetimes of other Tor versions on that table too.Let us know if we can do anything to make the process easier.And lastly, I am cc'ing the new network health mailing list (whichhas public archives), to help us stay synced:https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkHealthTeamThanks!--$name
Alright, I attached the file containing all the relay operators with relays running obsolete Tor versions we want to reject (that's from last week). Additionally, I attached another file showing the remaining operators on those versions, a bit more than a week after the email campaign.
We can see that the amount got reduced by 1/3 which is encouraging, even though I have not made a detailed analysis yet whether the operators actually migrated to newer versions or shut them down or... Interestingly, though, the sum of the bandwidth shown for the relays in both files rose. I have not looked as to why this happened.
Some explanations to the annotations I made on the file I used for sending the emails:
"[x]" means I sent an email successfully.
"[xf]" means I sent an email but I got an error back.
"[xfam]" means I sent an email and asked to set MyFamily properly while I was at it.
"[f due to $]" means I did not send an email due to $.
Additionally, you'll see that I changed my strategy for sending mails to the operators during the whole process: First, I send the email and Cc'ed the network-health list without any Reply-To header. Then I did the same with Reply-To header set to the mailing list. Then I only Cc'ed the list with the proper Reply-To header if the email address was not obfuscated. And then due to complaints by folks subscribed to the list I sent the last batch of mails without Cc'ing the network-health.
I added another file showing the relays not on 0.2.9.x or 0.4.0.x taken from earlier today anymore but from the consensus I used for emailing folks. The quick takeaway is the reason for them not being on my list anymore is due to either being down (that is they don't have upgraded and relay search is showing them as down OR they are not visible on relays search anymore) (26.5%) or having upgraded to 0.3.5.x (24%), or having upgraded to 0.4.x (49,4%).
I have not looked at whether the LTS folks upgraded to LTS or whether they used straight the latest stable Tor version on their platform.
Additionally, I found that a bunch of relays (10) where in the group or relays running 0.2.9.x or 0.4.0.x in today's consensus but not being present in the one last week. I mailed 8/10 of them and asked for an upgrade (2/10 did not have a usable ContactInfo set).