Opened 3 years ago

Last modified 19 months ago

#16696 new defect

BWauth no-consensus fallback logic may need revision

Reported by: starlight Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-dirauth bwauth measurement
Cc: aagbsn, karsten Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

At present both 'longclaw' and 'maatuska' have
dropped out of the BW consensus ('longclaw'
is restarting with new version, not sure
about 'maatuska').

This has caused the BW consensus logic to revert
to using relay self-measurement for BW weightings
due to fewer than three BW authorities participating.

The 10000 cap placed on self-measure values
is causing super-fast relays serious demotion
and slower relays corresponding promotion
in the consensus weighting.

Possibly this may result in network
unbalance issues. Some adjustment
of the logic seems in order.

Unsure of component and left it unset.

Child Tickets

Attachments (9)

alf.png (4.0 KB) - added by starlight 3 years ago.
ArachnideFR5.png (4.0 KB) - added by starlight 3 years ago.
becks.png (3.7 KB) - added by starlight 3 years ago.
BeSeeingYou.png (3.4 KB) - added by starlight 3 years ago.
HaveHeart.png (4.0 KB) - added by starlight 3 years ago.
IPredator.png (4.0 KB) - added by starlight 3 years ago.
redjohn.png (5.2 KB) - added by starlight 3 years ago.
Binnacle.pdf (68.3 KB) - added by starlight 3 years ago.
EmbraceTheChaos.pdf (67.3 KB) - added by starlight 3 years ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 3 years ago by starlight

Reviewed the initial effects of the fallback-to-self-measure
state which lasted from about 16:00 to 21:00 GMT on 7/30.

Does not look dire, and it even appears possible that
the middle-fast relays have sufficient excess
capacity to absorb the load redirected from
super-fast relays. Medium-fast relays tend
to be loaded to only 20-30% in the present
environment.

Attached several Blutemagie graphs for a sample
of super-fast relays (.PNG files). The relays
exhibiting the largest drop-off were saturated
and might even be gaming the BWauths for weightings
in excess of their true capacity.

Two attached .SVGs show middle-speed relay weightings,
Interestingly the weight-increase from the event was
less than an earlier eccentric values assigned by
a prior iteration of TorFlow. I operate Binnacle
and it easily handled the earlier elevated weight.

Three possibilities come to mind

1) change nothing as the current behavior, while not perfect, might be best

2) permit two or even one BWauth to fulfill a consensus quorum

3) increase the self-measure bandwidth cap, perhaps to somewhere from 15k to 30k

Changed 3 years ago by starlight

Attachment: alf.png added

Changed 3 years ago by starlight

Attachment: ArachnideFR5.png added

Changed 3 years ago by starlight

Attachment: becks.png added

Changed 3 years ago by starlight

Attachment: BeSeeingYou.png added

Changed 3 years ago by starlight

Attachment: HaveHeart.png added

Changed 3 years ago by starlight

Attachment: IPredator.png added

Changed 3 years ago by starlight

Attachment: redjohn.png added

Changed 3 years ago by starlight

Attachment: Binnacle.pdf added

Changed 3 years ago by starlight

Attachment: EmbraceTheChaos.pdf added

comment:2 Changed 3 years ago by starlight

.SVGs would not upload so printed them to PDFs with Tor Browser and uploaded

For now is easiest to look at them directly with Atlas.

comment:3 Changed 3 years ago by mikeperry

Cc: aagbsn karsten added

Can we extract the exact consensuses where the bw auths were down, and create an overlay on top of the torperf graphs for 50k, 1MB, and 5MB at https://metrics.torproject.org/torperf.html and https://metrics.torproject.org/torperf-failures.html?

Karsten did this some years ago the last time the bw auths failed, and we found that performance worsened by 4-5X when the bw auths were down for any significant duration. I'm curious if that's still the case. Unfortunately, I think we're currently seeing a lot of on-off flapping though, and not significant periods where they remain consistently down.

comment:4 Changed 3 years ago by starlight

Per Arma, no problems this time (after 2.5+ days):

https://lists.torproject.org/pipermail/tor-relays/2015-August/007537.html

I opine in the same thread that Tor runs better this way
with the current relay environment. Probably the result
of the dropping cost and increasing availability of
network capacity, improvements in the relay code, and
shorting-out known issues with BWauths. Appears that
fewer than 50 high-capacity relays have lost consensus
weight with the rest taking it up easily.

An unscientific sample of six out of 234 previously
Unmeasured=1 exit relays shows them now properly
utilized. Additional 250 non-exits are back in
the game as well.

https://lists.torproject.org/pipermail/tor-relays/2015-August/007539.html

comment:5 Changed 3 years ago by mikeperry

starlight - are we certain that every single consensus period in the past 2.5+ days has had no bw auth mesured lines? In that same thread Tom Ritter's comments make it seem like maatuska has been flapping on and off periodically since July 30th, but I have not looked at the consensus histories.

comment:6 Changed 3 years ago by starlight

I watched it the entire time and no quorum
whatsoever since maatuska quit at

2015-08-02-19-00-00-vote-49015F. . .

moria1 and gablemoo continue to post measurements
but the consensus algorithm ignores them and
uses self-measure capped at 10000 with
fewer than three reporting BWauths.

maatusak quit at exactly three days after it's
restart (frozen measurements) in response to
the referenced comments and I assume that
it failed with bug in the code that brings
up the bulks of the new measurements from
a clean start.

Wanted to see what would happen so I just
watched and stayed mum.

comment:7 Changed 3 years ago by starlight

An additional important element is guard-node
inertial. Takes 90 days for a consensus
weight change to fully propagate to guard
assignments, and the last time this happened
guards were not nearly so sticky.

The rebalance mainly impacts exit and
middle relays. But very-fast exit relays
are assigned zero guard-weights and behave
like pure exits. I suspect that even if
the guard assignments all rotated it would
still work fine at present.

Two-to-three times more relays, all with
lots more bandwidth, and relay code better
able to handle abuse-by-bot. Plus 500+
wasted relays coming back online. Perfect
balance is presently not as critical
as it was previously.

The PID controller algorithm certainly
should be kept against future high-load
scenarios (and perhaps with reduced
emphasis on high-bandwidth relays?),
but today bugs in TorFlow and the
shortage of BWauths should to be
corrected soon, especially to aid
the morale of the operators of those
500 unused relays.

======

As for the no-BW-consensus fallback algorithm,
my thought is it's not broken and doesn't need
fixing.

comment:8 Changed 3 years ago by starlight

Missed one:

Better core-network routing performance, capacity
and reachability.

We can thank Netflix et al, the march of technology
and perhaps the FCC for improved overall network
performance.

comment:9 Changed 3 years ago by ln5

FWIW, maatuska was stuck with consensus data from Jul 30 21:34 (CEST) due to misconfiguration.
This operational issue should be fixed since approximately an hour ago.

comment:10 Changed 3 years ago by starlight

How long until 'longclaw' with its update-to-current
TorFlow and from-scratch measurements comes back?

comment:11 in reply to:  3 Changed 3 years ago by karsten

Replying to mikeperry:

Can we extract the exact consensuses where the bw auths were down, and create an overlay on top of the torperf graphs for 50k, 1MB, and 5MB at https://metrics.torproject.org/torperf.html and https://metrics.torproject.org/torperf-failures.html?

Karsten did this some years ago the last time the bw auths failed, and we found that performance worsened by 4-5X when the bw auths were down for any significant duration. I'm curious if that's still the case. Unfortunately, I think we're currently seeing a lot of on-off flapping though, and not significant periods where they remain consistently down.

It wouldn't be difficult to find out when there were fewer than 3 votes containing Measured values. The tarballs are all available on CollecTor. Mike, others, do you still care about this, and if yes, what time frame do you have in mind for these data (past 3 months, all of 2015, more)? aagbsn, if people care, would you want to pick this up, or should I do this?

comment:12 Changed 3 years ago by rl1987

Component: - Select a componentTor

comment:13 Changed 3 years ago by nickm

Milestone: Tor: 0.2.???

comment:14 Changed 2 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:15 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:16 Changed 20 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:17 Changed 19 months ago by nickm

Keywords: tor-dirauth bwauth measurement added
Severity: Normal
Note: See TracTickets for help on using tickets.