Currently, a relay needs to be up for at least 24h to be considered a HSDir. I think we should change this to 24h and 30 minutes or 25 hours, because this will give the directory authorities a little bit more time to notice a relay has disappeared before voting HSDir for it. This helps because a lot of our relays are on connections that disconnect once every 24hours exactly (which is the reason for the 24h interval in the first place), and it might help to ensure better reachability of HSDir-duties performing relays.
Does this change need a proposal or is discussion here fine?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Currently, a relay needs to be up for at least 24h to be considered a HSDir. I think we should change this to 24h and 30 minutes or 25 hours, because this will give the directory authorities a little bit more time to notice a relay has disappeared before voting HSDir for it. This helps because a lot of our relays are on connections that disconnect once every 24hours exactly (which is the reason for the 24h interval in the first place), and it might help to ensure better reachability of HSDir-duties performing relays.
Sounds like a fine plan. Let me know when you have a patch.
Ideally we would get some feedback from our analysis project (what I'm starting to call the part of metrics that looks at data and tells us answers about what's going on) about how quickly nodes that have the HSDir flag disappear, to prove that this change is needed (and to prove that the new number we've picked is a good one). Or we could just guess a good number and switch to it.
Does this change need a proposal or is discussion here fine?
#2649 (moved) does need to go into the next 0.2.2.x-alpha.
...the one about changing the required uptime for being HSDir?
nickm: Yes.
rransom: ok. why?
(or say why on the ticket)
nickm: The set of routers with the HSDir flag is not currently stable enough, and the only way to test a new criterion for the HSDir flag is to actually use it on the live network, and I assume that requires that it be put into a 0.2.2.x-alpha release so the non-developer-operated DAs will use it.
rransom: fair enough; changed the milestone
See bug2649 ( !ssh://mob@repo.or.cz/srv/git/tor/rransom.git bug2649 ) for a branch that (a) allows DAs to not vote on the HSDir flag (and makes them default to not voting on it), and (b) increases the default minimum uptime from 24 hours to 25 hours.
The reason to make DAs default to not voting on the HSDir flag is that I doubt that 25 hours is high enough to keep the HSDir set stable, and I don't know what minimum uptime is high enough, so we will need to experiment for a while, and that is much easier if the non-developer-operated DAs don't need to be updated and/or reconfigured for every test of a new value.
Do we actually have good statistics on hsdir stability? If so, where? Without those already in place we shouldn't start experimenting. Also, we should figure out what super-increased stability entails - one goal of the design is that the position in the ring shifts slowly.
The first patch sets a bad default imo. If we want to do it at all (I'd prefer we don't) then it should be enabled by default, not disabled. We shouldn't add more and more options so subset of the dirauths gets to decide something and the others don't.
I agree with Sebastian. It's okay to ask authority ops to disable the thing for now for testing, but it's kind of iffy IMO to disable it by default. I doubt whether all of the more responsive relay ops would even remember to turn this on.
Other than that, this looks fine... except that set_routerstatus_from_routerinfo is becoming one of those functions with multiple flag arguments. That's starting to get error-prone. If we need to add any more flags, we should change it to take an unsigned bitfield.
I agree with Sebastian. It's okay to ask authority ops to disable the thing for now for testing, but it's kind of iffy IMO to disable it by default. I doubt whether all of the more responsive relay ops would even remember to turn this on.
See [hsdir-set-instability-graph.pdf hsdir-set-instability-graph.pdf] for a graph of (size of set symmetric difference between HSDir sets in consensus N and consensus N+i)/(size of HSDir set in consensus N), for i = 1..4. (I'm counting relays entering the HSDir set as well as relays leaving the set because both events make an HSDir relay unavailable to hidden services and clients.)
The ratio shown in the graph is an estimate of the probability that an HS will be unable to deliver a single copy of its descriptor to a client due to the HS, its client, and the HSDir relay responsible for that copy having different consensuses; my understanding is that clients (both HS clients and HS servers) routinely have consensuses out of date by two or three hours, and sometimes four hours. We can probably assume that the disruptions are uniformly distributed around the HSDir ring, in which case the probability that an HS will be entirely unavailable to a client for a given hour due to HSDir-set instability is roughly one-sixth the three-hour probability shown in the graph for that hour.
The scripts used to generate this graph are currently in task-2649 ( git://git.torproject.org/rransom/metrics-tasks.git task-2649 ).