In the current bandwidth authority code, when a fetch attempt fails, it will still be counted as a circuit that went through all of the nodes -- even if those nodes weren't responsible for the failure.
This has the potential of resulting in a relay not being measured sufficiently, or at all: the code will consider failures from unstable nodes to be relevant for nodes that are perfectly stable.
In slices where exits and entries aren't well-distributed (like, all of them) this can result in some nodes not being measured at all, and losing their consensus weight. This seems to affect exits a lot more than it does other relay types: people on tor-relays@ have mentioned that removing their exit policies gets their consensus weight back, and I have been able to reproduce this.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
The problem does affect Exits more than middle relays, and a lot of operators reported that changing to middle relay instead of exit helped, but there also have been cases when even changing the ExitPolicy to reject : didn't bring the consensus weight back.
How does a bwauth exactly connect to an Exit and tries to measure it? What can happen in between this to make the bwauth think the Exit is misbehaving?
In sbws, when a relay is going to be measured, it selects randomly other relay that has double or equal bandwidth than the relay to measure, so it will likely not fail because of the other relay.
The next time it will be measured, it will likely not be measured with the same other relay.
However, the fastest relay will be restricted to the be measured with slower relays and small set of possible relays. There's an scaling process after this, but maybe it's a good idea that gets restricted anyway.
In version 1.0.2, sbws was even prioritizing to measure relays with higher number of failures, but it was observed that then it'll continuosly try to measure unstable relays that will probably fail again.
This has been removed in the last version and it only prioritizes relays to measure based on how long ago they were measured before.
Regarding the exit policies, it only affects to choose whether the relay to measure will be the first or the second hop and it only checks that policy allows to exit to port 443.
A reason why an exit might always fail to be measured is when it retrieves the data from a CDN, the local resolver returns an IPv6 address, and the exit can exit to an IPv6 address. Maybe this is something to be monitered, but it'd not happen when #28463 (moved) is implemented.
I think this ticket can be closed, but it'd be great to get opinions on whether sbws design solves this.