A relay operator reports their relay submitted a descriptor with a 0 DirPort after a version upgrade, then 62 seconds later submitted a descriptor with the configured DirPort.
Unfortunately, this was right on the consensus deadline, so they had one consensus where their DirPort was 0.
The above are all relays that were actually on the 12/11 list and did not appear on the 01/12 list.
Also two examples where updateFallbackDirs.py reported a port change as the first encountered reason for
exclusion on 01/12 though these relays were excluded on 12/11 for a different reason:
It would be nice to backport any fix to 0.2.6 and 0.2.7, so this issue doesn't happen each time we release a new version of Tor. But we usually reserve backports for security or serious functionality issues, so I don't know if a backport will happen.
Looked briefly at local logs and the logic. Current design appears to always publish a descriptor with DirPort=0 during boot. run_scheduled_events() publishes a changed descriptor once every 60 seconds, and the typical boot sequence runs the DirPort reachability at slightly over 60 seconds from start.
:16:06 Tor[5806]: Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.:16:07 Tor[5806]: Bootstrapped 85%: Finishing handshake with first hop:16:07 Tor[5806]: Bootstrapped 90%: Establishing a Tor circuit:16:08 Tor[5806]: Tor has successfully opened a circuit. Looks like client functionality is working.:16:08 Tor[5806]: Bootstrapped 100%: Done:17:09 Tor[5806]: Self-testing indicates your DirPort is reachable from the outside. Excellent.
When this happens just before the consensus vote time at minute 50 of each hour, the race condition identified by Teor occurs. Reexamined half of the above events and in all cases problem descriptor was published at 46-49 minutes.
Perhaps consider_publishable_server() should be adjusted to delay the descriptor until the DirPort test is complete.
Looked briefly at local logs and the logic. Current design appears to always publish a descriptor with DirPort=0 during boot. run_scheduled_events() publishes a changed descriptor once every 60 seconds, and the typical boot sequence runs the DirPort reachability at slightly over 60 seconds from start.
...
When this happens just before the consensus vote time at minute 50 of each hour, the race condition identified by Teor occurs. Reexamined half of the above events and in all cases problem descriptor was published at 46-49 minutes.
Perhaps consider_publishable_server() should be adjusted to delay the descriptor until the DirPort test is complete.
I agree.
This sounds somewhat similar to #17782 (moved). When its address changes, tor doesn't test ORPort reachability. So it will go ahead and publish a descriptor with the wrong address, as long as an old address was reachable at some point.
We should make tor wait for ORPort and DirPort reachability every time:
it starts up,
the config changes address, ORPort, or DirPort.
There's a drawback here, which is that tor won't ever publish a descriptor if only the ORPort is reachable (perhaps due to a broken firewall config). I think we should have a timeout after which tor warns, then publishes the descriptor without the DirPort. Given the time ranges we're seeing, the timeout should be at least 20 minutes.
(Relays with no DirPort will still be used for directory requests once #12538 (moved) is merged.)
Jun 12 17:49:17 Tor: Parsing GEOIP IPv4 file ./geoip.Jun 12 17:49:17 Tor: Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.Jun 12 17:49:17 Tor: Your Tor server's identity key fingerprint is 'Binnacle 4F0DB7E687FC7C0AE55C8F243DA8B0EB27FBF1F2'Jun 12 17:49:21 Tor: We now have enough directory information to build circuits.Jun 12 17:49:21 Tor: Bootstrapped 80%: Connecting to the Tor network.Jun 12 17:49:21 Tor: Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.Jun 12 17:49:22 Tor: Bootstrapped 85%: Finishing handshake with first hop.Jun 12 17:49:22 Tor: Bootstrapped 90%: Establishing a Tor circuit.Jun 12 17:49:23 Tor: Tor has successfully opened a circuit. Looks like client functionality is working.Jun 12 17:49:23 Tor: Bootstrapped 100%: Done.Jun 12 17:50:24 Tor: Self-testing indicates your DirPort is reachable from the outside. Excellent.Jun 12 17:51:25 Tor: Performing bandwidth self-test...done.
In this case DirPort=0 was published just before minute 50 and and DirPort=80 just after, but in the case of kitten both appear to have been pushed just before minute 50:
However I suspect the system clock on that relay's server is skewed and so the actual descriptor pushes were before and after the consensus vote event.
Searched the logs and found two incidents when the relay was started in minute 48 of the hour. Checked the consensus history and the DirPort=0 glitch did not appear either time.
So untill this issue is fixed the advice for anyone wishing to avoid having their relay knocked off the Fallback Directory list is to not start the daemon in minute 49 of the hour and probably avoid minute 48 for good measure. Highly advisable to run ntpd on the server and have accurate time.
However I suspect the system clock on that relay's server is skewed and so the actual descriptor pushes were before and after the consensus vote event.
I’m the owner of kitten1 and it’s NTP-sync’ed
I just check it clock, all seems good.
The commit merges cleanly into maint-0.2.7 and master.
The unit tests and make test-network[-all] pass on maint-0.2.6, maint-0.2.7, and master.
It also passes src/test/test-network.sh --sleep 600 on master with AssumeReachable 0 (chutney usually has AssumeReachable 1, which disables reachability checking).
The fix is a one-line change to decide_if_publishable_server() to restore the original intent of a1f1fa6ab in 0.1.0.1-rc. From that commit message:
Try to pull down a directory via Tor to see if our DirPort is working.Try to extend a circuit back to us to see if our ORPort is working.Only publish a descriptor if they're both reachable.
It then makes consequential changes to log messages and the function comment to make the new behaviour clear.
If we decide to backport this, we don't need to do a release for this fix. It just needs to be included in the next minor release in each series, to avoid relay version upgrades triggering this bug.
Trac: Keywords: 026-maybe-backport, 027-maybe-backport deleted, 026-backport, 027-backport added Status: new to needs_review
I vote against backport to 0.2.6 -- mainly because there will likely never be another 0.2.6 release, and if there is, approximately nobody will update to it.
I spoke to teor and he agreed that close is fine.
Trac: Resolution: N/Ato fixed Keywords: 026-backport deleted, N/Aadded Milestone: Tor: 0.2.6.x-final to Tor: 0.2.7.x-final Reviewer: N/AtoN/A Status: needs_review to closed