Relay submitted a descriptor with 0 DirPort due to a self-test race condition

changed milestone to %Tor: 0.2.7.x-final

added 027-backport TorCoreTeam201605 component::core tor/tor milestone::Tor: 0.2.7.x-final owner::teor priority::medium resolution::fixed severity::normal status::closed type::defect version::tor 0.2.6.10 labels

Complete list of impacted relays:

BF0FB582E37F738CD33C3651125F2772705BB8E8 12-28:17 quadhead
86E78DD3720C78DA8673182EF96C54B162CD660C 12-13:11 kitten1
6DE61A6F72C1E5418A66BFED80DFB63E4C77668F 12-19:11 eriador
39F096961ED2576975C866D450373A9913AFDC92 12-28:06 metaether
92CFD9565B24646CAC2D172D3DB503D69E777B8A 12-16:14 bakunin
6E7CB6E783C1B67B79D0EBBE7D48BC09BD441201 12-14:07 Paris
9E0CBB6958CE61DC631E8660963EB644BE92B256 12-12:04 freshman
D74ABE34845190E242EC74BA28B8C89B0A480D4B 12-14:07 Helen
14E3111C54BB532FB39AA3C1367D0970DA2937D5 12-18:17 idfTor2
78BBAC4B66067E8D25AA78852257C08D2E7EB357 12-18:09 grothendieck
EC639EDAA5121B47DBDF3D6B01A22E48A8CB6CC7 12-24:08 rainbowdash

The above are all relays that were actually on the 12/11 list and did not appear on the 01/12 list.

Also two examples where updateFallbackDirs.py reported a port change as the first encountered reason for exclusion on 01/12 though these relays were excluded on 12/11 for a different reason:

C43FA6474A9F071E9120DF63ED6EB8FDBA105234 12-25:03 ArachnideFR5
5665A3904C89E22E971305EE8C1997BCA4123C69 11-15:18 GermanCraft

Trac:
Cc: N/A to starlight@binnacle.cx

It would be nice to backport any fix to 0.2.6 and 0.2.7, so this issue doesn't happen each time we release a new version of Tor. But we usually reserve backports for security or serious functionality issues, so I don't know if a backport will happen.

Trac:
Version: N/A to Tor: 0.2.6.10
Keywords: N/A deleted, 026-maybe-backport, 027-maybe-backport added

Looked briefly at local logs and the logic. Current design appears to always publish a descriptor with DirPort=0 during boot. run_scheduled_events() publishes a changed descriptor once every 60 seconds, and the typical boot sequence runs the DirPort reachability at slightly over 60 seconds from start.

:16:06 Tor[5806]: Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
:16:07 Tor[5806]: Bootstrapped 85%: Finishing handshake with first hop
:16:07 Tor[5806]: Bootstrapped 90%: Establishing a Tor circuit
:16:08 Tor[5806]: Tor has successfully opened a circuit. Looks like client functionality is working.
:16:08 Tor[5806]: Bootstrapped 100%: Done
:17:09 Tor[5806]: Self-testing indicates your DirPort is reachable from the outside. Excellent.

When this happens just before the consensus vote time at minute 50 of each hour, the race condition identified by Teor occurs. Reexamined half of the above events and in all cases problem descriptor was published at 46-49 minutes.

Perhaps consider_publishable_server() should be adjusted to delay the descriptor until the DirPort test is complete.

Replying to starlight:

Looked briefly at local logs and the logic. Current design appears to always publish a descriptor with DirPort=0 during boot. run_scheduled_events() publishes a changed descriptor once every 60 seconds, and the typical boot sequence runs the DirPort reachability at slightly over 60 seconds from start. ... When this happens just before the consensus vote time at minute 50 of each hour, the race condition identified by Teor occurs. Reexamined half of the above events and in all cases problem descriptor was published at 46-49 minutes.

Perhaps consider_publishable_server() should be adjusted to delay the descriptor until the DirPort test is complete.

I agree.

This sounds somewhat similar to #17782 (moved). When its address changes, tor doesn't test ORPort reachability. So it will go ahead and publish a descriptor with the wrong address, as long as an old address was reachable at some point.

We should make tor wait for ORPort and DirPort reachability every time:

it starts up,
the config changes address, ORPort, or DirPort.

There's a drawback here, which is that tor won't ever publish a descriptor if only the ORPort is reachable (perhaps due to a broken firewall config). I think we should have a timeout after which tor warns, then publishes the descriptor without the DirPort. Given the time ranges we're seeing, the timeout should be at least 20 minutes.

(Relays with no DirPort will still be used for directory requests once #12538 (moved) is merged.)

Just remembered noticing earlier a strange last address-change date/time for my relay. On looking I see it was affected by this glitch six months ago:

last_changed_address_or_port:2015-06-12 18:00:00

Have log entries for it:

06-12-17-cons:r Binnacle <fp> aWzbTbPNq8OoPHFtMnxMTVOHDuY 15:11:13 108.53.208.157 443 80
06-12-18-cons:r Binnacle <fp> dIaQvTy5UjVYtdRVa2xPDSwhpAw 17:49:21 108.53.208.157 443 0
06-12-19-cons:r Binnacle <fp> RoqngB0ium+unPem09nCwnzBqI4 18:32:44 108.53.208.157 443 80

Jun 12 17:49:17 Tor: Parsing GEOIP IPv4 file ./geoip.
Jun 12 17:49:17 Tor: Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.
Jun 12 17:49:17 Tor: Your Tor server's identity key fingerprint is 'Binnacle 4F0DB7E687FC7C0AE55C8F243DA8B0EB27FBF1F2'
Jun 12 17:49:21 Tor: We now have enough directory information to build circuits.
Jun 12 17:49:21 Tor: Bootstrapped 80%: Connecting to the Tor network.
Jun 12 17:49:21 Tor: Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
Jun 12 17:49:22 Tor: Bootstrapped 85%: Finishing handshake with first hop.
Jun 12 17:49:22 Tor: Bootstrapped 90%: Establishing a Tor circuit.
Jun 12 17:49:23 Tor: Tor has successfully opened a circuit. Looks like client functionality is working.
Jun 12 17:49:23 Tor: Bootstrapped 100%: Done.
Jun 12 17:50:24 Tor: Self-testing indicates your DirPort is reachable from the outside. Excellent.
Jun 12 17:51:25 Tor: Performing bandwidth self-test...done.

In this case DirPort=0 was published just before minute 50 and and DirPort=80 just after, but in the case of kitten both appear to have been pushed just before minute 50:

12-13-09-cons:r kitten1 <fp> vG0ZWLi31UXoqz4H2H/hweSvxqo 04:44:19 62.210.124.124 9001 9030
12-13-10-cons:r kitten1 <fp> vG0ZWLi31UXoqz4H2H/hweSvxqo 04:44:19 62.210.124.124 9001 9030
12-13-11-cons:r kitten1 <fp> yUnPgGOwhq5QfUEe1Eg0zKCU+w0 10:48:46 62.210.124.124 9001 0
12-13-12-cons:r kitten1 <fp> JrZF6+VnnYZuQHRllufg53dbjH8 10:49:48 62.210.124.124 9001 9030
12-13-13-cons:r kitten1 <fp> JrZF6+VnnYZuQHRllufg53dbjH8 10:49:48 62.210.124.124 9001 9030

However I suspect the system clock on that relay's server is skewed and so the actual descriptor pushes were before and after the consensus vote event.

Searched the logs and found two incidents when the relay was started in minute 48 of the hour. Checked the consensus history and the DirPort=0 glitch did not appear either time.

So untill this issue is fixed the advice for anyone wishing to avoid having their relay knocked off the Fallback Directory list is to not start the daemon in minute 49 of the hour and probably avoid minute 48 for good measure. Highly advisable to run ntpd on the server and have accurate time.

Replying to starlight:

However I suspect the system clock on that relay's server is skewed and so the actual descriptor pushes were before and after the consensus vote event.

I’m the owner of kitten1 and it’s NTP-sync’ed I just check it clock, all seems good.

Trac:
Username: aeris

Please see my branch bug18050 based on maint-0.2.6 on https://github.com/teor2345/tor.git

The commit merges cleanly into maint-0.2.7 and master. The unit tests and make test-network[-all] pass on maint-0.2.6, maint-0.2.7, and master. It also passes src/test/test-network.sh --sleep 600 on master with AssumeReachable 0 (chutney usually has AssumeReachable 1, which disables reachability checking).

The fix is a one-line change to decide_if_publishable_server() to restore the original intent of a1f1fa6ab in 0.1.0.1-rc. From that commit message:

Try to pull down a directory via Tor to see if our DirPort is working.
Try to extend a circuit back to us to see if our ORPort is working.
Only publish a descriptor if they're both reachable.

It then makes consequential changes to log messages and the function comment to make the new behaviour clear.

If we decide to backport this, we don't need to do a release for this fix. It just needs to be included in the next minor release in each series, to avoid relay version upgrades triggering this bug.

Trac:
Keywords: 026-maybe-backport, 027-maybe-backport deleted, 026-backport, 027-backport added
Status: new to needs_review

Applied to 0.2.7 and later; considering for 0.2.6 backport.

Trac:
Milestone: Tor: 0.2.8.x-final to Tor: 0.2.6.x-final

I vote against backport to 0.2.6 -- mainly because there will likely never be another 0.2.6 release, and if there is, approximately nobody will update to it.

I spoke to teor and he agreed that close is fine.

Trac:
Resolution: N/A to fixed
Keywords: 026-backport deleted, N/A added
Milestone: Tor: 0.2.6.x-final to Tor: 0.2.7.x-final
Reviewer: N/A to N/A
Status: needs_review to closed

Trac:
Status: closed to reopened
Resolution: fixed to N/A

Trac:
Owner: N/A to teor
Status: reopened to assigned

Trac:
Status: assigned to closed
Resolution: N/A to fixed
Keywords: 027-backport deleted, 027-backport TorCoreTeam201605 added

closed

mentioned in issue #18086 (moved)

mentioned in issue #18616 (moved)

mentioned in issue #18623 (moved)

mentioned in issue #18812 (moved)

Relay submitted a descriptor with 0 DirPort due to a self-test race condition

Child items ...

Activity