Opened 6 months ago

Last modified 5 weeks ago

#25685 new defect

Tor relays publish a new descriptor but authorities drop it because they think it's only cosmetically different, and then the relay waits 18 more hours to publish, thus falling out of the consensus

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.3.6.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 034-roadmap-proposed
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We have a design flaw, or at least an impedance mismatch, in our descriptor publishing algorithm.

Relays publish a new descriptor when they think something has sufficiently changed (e.g. bandwidth, IP address, exit policy, etc) or when 18 hours have passed.

Directory authorities accept the new descriptor when *they* think it has sufficiently changed. If they think it hasn't, they quietly drop it:

    log_info(LD_DIRSERV,
             "Not replacing descriptor from %s (source: %s); "
             "differences are cosmetic.",
             router_describe(ri), source);

The trouble comes when things get out of sync: the relay thinks it published recently so it is still early in its 18 hour timer, but the authorities discarded that descriptor. Then when the "current" descriptor becomes 24 hours old, it gets discarded, and the relay falls out of the consensus.

I don't have stats on how frequently this out-of-sync actually happens, but it's enough to have tickets filed about it (#23638) and it's enough to have confused/sad posts from relay operators about it every month:
https://lists.torproject.org/pipermail/tor-dev/2018-March/013030.html
https://lists.torproject.org/pipermail/tor-relays/2018-March/014764.html

We deployed a bandaid in 0.2.3.4-alpha (commit 1f4b694, #3327), that makes relays look in the consensus and publish a new descriptor more aggressively if they find they're not listed. That hack is apparently needed quite often: in #21642 I said "So 426 of our ~7300 relays stayed in the consensus in the last 12.5 hours because of this hack."

But I think we haven't actually explored whether the bandaid helps all of the relays stay in the consensus all of the time, or if there are still "holes" in it that mean some relays fall out sometimes. The reports above make me think that yes there are still holes.

Potential ways forward:

  • Match up the descriptor upload timings, as seen by a dir auth, with the appearance of relays in the consensus. See how many of the relays publishing for reason "version listed in consensus is quite old" are missing any hours in the consensus.
  • If there are some that fall out of the consensus entirely, think about ways to make the republish more aggressive and earlier, or if it is already more aggressive and earlier, figure out why it isn't sticking.
  • Think about ways to make our relay-side decisions about "is it different enough" synchronize better with our dirauth-side decisions. Now that we're doing hourly consensus documents, can the dir auths be more lenient of similar-ish descriptors, because there's only one "winner" of a descriptor each hour? This poor synchronization is part of why we couldn't implement proposal 275 when we wanted to.

Child Tickets

TicketStatusOwnerSummaryComponent
#23638closedmoria1, running 0.3.2.1-alpha-dev, stops getting voted aboutCore Tor/Tor

Change History (5)

comment:1 Changed 6 months ago by teor

One way to resolve this issue is to make dirauths just believe relays when they republish their descriptors.

We might want to rate-limit republishing to N descriptors per hour, but I'm not sure what that gets us, because we've already received and parsed the descriptor by the time we reject it.

comment:2 in reply to:  description Changed 6 months ago by arma

Replying to arma:

Think about ways to make our relay-side decisions about "is it different enough" synchronize better with our dirauth-side decisions.

One way in which it will hard to make them completely synchronized (motivated by the bug report in #23638) is that if the relay restarts, when it comes back it has no memory of what it might have published in its previous run. That would argue for "make dir auths more lenient" as a primary direction to pursue.

comment:3 in reply to:  1 Changed 6 months ago by arma

Replying to teor:

We might want to rate-limit republishing to N descriptors per hour, but I'm not sure what that gets us, because we've already received and parsed the descriptor by the time we reject it.

Well, for one it gets us "the dir auths don't fill up their disk with cached descriptors", since because of bug #25686 I'm seeing 1.65M buggy descriptor publishes per two weeks, which works out to more than one per second on average.

comment:4 Changed 4 months ago by 49ax56xr36

Perhaps keep track of the time of the last validated descriptor upload attempt when its contents are discarded as trivial? Use that time when determining whether to age-out a relay? During the six hours when changes are allowed the "valid attempt" timer is cleared and time from the descriptor applied.

comment:5 Changed 5 weeks ago by nickm

Milestone: Tor: 0.3.6.x-final
Note: See TracTickets for help on using tickets.