Seems to me that the ways to deal with the edge case you describe above are:
a) Eventually clients try authorities to fetch mds if all else fails (bad for the health of dirauths). I think that's what you suggested basically.
Yes, we should implement this, if the other fixes don't resolve the md issue.
It's only bad for the authorities if a lot of clients do it all the time.
b) We remove dirauths from the fallback list (less traffic on dirauths. any drawback?)
You can't avoid this issue by stopping clients contacting authorities. Because there are other ways that a client can have a consensus with some microdescs that are not on its guards.
And we already weight dirauths low on the fallback list, so not many clients contact them.
Removing authorities from the fallback list would break clients that disable fallbacks, and clients on non-standard networks. Also, it would break clients if too many fallbacks go down on the public network.
c) We make dirservers fetch new consensuses/mds much faster than 30mins delay (bad for health of dirauths).
You are right, this is bad because it requires every relay to do this, 100% of the time. Which is impossible, as well as being bad for the network.
Load on dirauths is extremely light right now IMO. Having clients contact dirauths would be bad, but having relays contact dirauths early and a bit more aggressively (and maybe having dirauths ultra-aggressively try every other dirauth) doesn't sound like the end of the world to me. I am currently seeing less than 3MB/s (averaged over 30 second intervals) peak outgoing bandwidth on my dirauth which is basically negligible.
Load on dirauths is extremely light right now IMO. Having clients contact dirauths would be bad, but having relays contact dirauths early and a bit more aggressively (and maybe having dirauths ultra-aggressively try every other dirauth) doesn't sound like the end of the world to me. I am currently seeing less than 3MB/s (averaged over 30 second intervals) peak outgoing bandwidth on my dirauth which is basically negligible.
I am not sure we can have relays try fast enough to make sure this bug never happens on clients. That would cause issues every hour from about hh:00 to hh:01.
Why don't we have clients remember where they got the consensus, and try it for any missing microdescs before trying an authority?
I am thinking the right design for this would be a kind of staged distribution, where relays get to fetch a new consensus before it's valid (but don't use it yet). This might be quite tricky to implement with the current system though :/
Seems to me that the ways to deal with the edge case you describe above are:
a) Eventually clients try authorities to fetch mds if all else fails (bad for the health of dirauths). I think that's what you suggested basically.
Yes, we should implement this, if the other fixes don't resolve the md issue.
It's only bad for the authorities if a lot of clients do it all the time.
True. But we have lots of clients, so I think before doing this we might want to calculate the probability of this happening, to try to understand how many clients will end up doing this behavior.
b) We remove dirauths from the fallback list (less traffic on dirauths. any drawback?)
You can't avoid this issue by stopping clients contacting authorities. Because there are other ways that a client can have a consensus with some microdescs that are not on its guards.
True. But it's less likely if dirauths are not in the picture, since basically your edge-case is guaranteed to happen everytime a client randomly picks a dirauth early in the hour (e.g. between hh:00 and hh:05).
And we already weight dirauths low on the fallback list, so not many clients contact them.
Removing authorities from the fallback list would break clients that disable fallbacks, and clients on non-standard networks. Also, it would break clients if too many fallbacks go down on the public network.
Hmm, I don't understand these points exactly. Can you expand? Why would clients break worse than currently if we remove dirauths from fallbacks? We can add a few more relays in the fallbacks to compensate.
I think we should implement an authority md fetch for clients that run out of microdesc attempts. And I think they can easily handle the load of a few mds, because they are handling a similar consensus load from clients and relays already.
I also don't think removing fallbacks from the list will help much, because bootstrapping clients try authorities anyway.
Seems to me that the ways to deal with the edge case you describe above are:
a) Eventually clients try authorities to fetch mds if all else fails (bad for the health of dirauths). I think that's what you suggested basically.
Yes, we should implement this, if the other fixes don't resolve the md issue.
It's only bad for the authorities if a lot of clients do it all the time.
True. But we have lots of clients, so I think before doing this we might want to calculate the probability of this happening, to try to understand how many clients will end up doing this behavior.
Yes, I think we should estimate how often it will happen. We can afford to have a few thousand clients download a few mds per hour (0.1% of 2 million clients per hour). Because we have a few thousand relays download two consensus flavours and all the new mds from the authorities, and they are handling this load fine.
b) We remove dirauths from the fallback list (less traffic on dirauths. any drawback?)
You can't avoid this issue by stopping clients contacting authorities. Because there are other ways that a client can have a consensus with some microdescs that are not on its guards.
True. But it's less likely if dirauths are not in the picture, since basically your edge-case is guaranteed to happen everytime a client randomly picks a dirauth early in the hour (e.g. between hh:00 and hh:05).
Yes. Directory mirrors download at random between hh:00 and hh:30, so missing microdescriptors are guaranteed to happen for 50% of clients that bootstrap off authorities (9/(150*10 + 9) ~= 0.6% of clients bootstrap off authorities) at hh:15. Assuming that clients bootstrap at random throughout the hour, this is 0.6% * 0.25 = 0.15% of bootstrapping clients per hour. So we can afford to have all these clients try an authority for their mds, because the number of bootstrapping clients is much lower than the number of running clients. (We could afford to have 0.15% of all clients do this, not just 0.15% of the bootstrapping ones.)
The actual figure is slightly higher than this, because after trying 3 fallbacks/authorities, 0.3.2 and later clients try an authority directly. When 10% of fallbacks are down, 0.1% of clients try an authority for this reason. But the authorities are already handling this consensus fetch traffic fine, so an extra few mds won't be an issue.
(For 0.3.1 and earlier clients, 100% try an authority and a fallback straight away when bootstrapping, and they pick whichever wins. So we might want to think a bit harder about backporting #17750 (moved) and #23347 (moved), if we also want to backport an authority md fetch to earlier versions.)
We could easily reduce the 0.3.2 client authority fetch to 0.115% (0.015% + 0.1%) by weighting the fallbacks at 100 rather than 10. But that doesn't remove the 0.1% that try an authority after 3 fallbacks. So I'm not sure re-weighting (or removing) would have the impact you want.
And we already weight dirauths low on the fallback list, so not many clients contact them.
Removing authorities from the fallback list would break clients that disable fallbacks, and clients on non-standard networks.
These clients would have nothing in the fallback list to bootstrap off, because they don't use the hard-coded public fallbacks. We can avoid this by only removing the authorities when using public fallbacks, but that makes the code hard to test in chutney.
Also, it would break clients if too many fallbacks go down on the public network.
Hmm, I don't understand these points exactly. Can you expand? Why would clients break worse than currently if we remove dirauths from fallbacks? We can add a few more relays in the fallbacks to compensate.
The idea of having authorities in the fallback list is that clients will use them if a large number of the fallbacks break for some reason (for example, a bad bug on mirrors). I am not sure if this actually works, but let's not break it until we are sure:
removing them will help this issue, and
removing them won't create any other issues.
I don't think removing authorities from the fallback list would help this issue, because 0.1% of bootstrapping clients will still try an authority when they fail 3 fallbacks.
I think we should implement an authority md fetch for clients that run out of microdesc attempts. And I think they can easily handle the load of a few mds, because they are handling a similar consensus load from clients and relays already.
I also don't think removing fallbacks from the list will help much, because bootstrapping clients try authorities anyway.
I'm continuing the discussion here altho it's worth mentioning that teor also added some more calculations in #24113 (moved).
I think I can get behind doing an authority md fetch for clients that have failed too many microdesc attempts. To further reduce the load on dirauths, perhaps we should do this only if we are missing descriptors for some of our primary guards (i.e. only if we are missing very crucial mds), since clients can/should usually tolerate missing a few random mds.
If we agree on the general concept here, I will come up with an implementation plan early next week.
I think we should implement an authority md fetch for clients that run out of microdesc attempts. And I think they can easily handle the load of a few mds, because they are handling a similar consensus load from clients and relays already.
I also don't think removing fallbacks from the list will help much, because bootstrapping clients try authorities anyway.
I'm continuing the discussion here altho it's worth mentioning that teor also added some more calculations in #24113 (moved).
I think I can get behind doing an authority md fetch for clients that have failed too many microdesc attempts. To further reduce the load on dirauths, perhaps we should do this only if we are missing descriptors for some of our primary guards (i.e. only if we are missing very crucial mds), since clients can/should usually tolerate missing a few random mds.
I think asking an authority is a good idea.
Is it also worth asking a fallback first?
This might be another way to reduce load on the authorities.
And I think it would really help some clients if we do it, because some networks block authority addresses.
If we only ask an authority or fallback when we are missing a guard microdesc, this leaks our guards to the authority or fallback.
I think that is probably ok. Because these queries are mixed in with a bunch of other client queries.
(Authorities see about as many client queries as they see relay queries.)
But here's what we can do to make the leak less obvious:
ask for all the missing microdescs, not just the primary guard ones
this has a very low impact, because we are already doing a request - we should definitely do it.
ask all the time, not just when we are missing primary guards
this has a higher impact, but I think we can easily afford to do it if we want to,
but I agree with you - I don't think we need to do it, so let's not bother right now.
Some detailed questions about the md request:
What if we are missing more microdescs than fit in a single request?
How do we make sure our primary guards are in that request?
What order do we usually use for md hashes in requests?
When we make multiple requests, do we usually split mds between them at random?
Do we usually sort the hashes to destroy ordering information?
(I can imagine myself writing a request that starts with the guard md hashes, and not realising I was leaking them.)
Some more detailed design questions, after reviewing #23817 (moved):
what should we do when we are using bridges, or all the authorities and fallbacks are excluded by an EntryNodes setting?
should we fetch mds from a fallback or an authority over a 3-hop path?
is this what bridge clients do already, or do they give up when they can't get something from their bridge(s)?
if we are willing to fetch missing microdescs over a 3-hop path, can we make should_set_md_dirserver_restriction() always return 1?
Trac: Summary: When our directory guards don't have each others' microdescs, we should mark some dead to When our directory guards don't have each others' microdescs, we should try an authority or fallback
My laptop's TBB 7.5 just got hit by this, I think. It is using default guard context, no bridges. I am not sure if it's this or #21969 (moved). Basically, it is running but being useless (can't load any web page, other apps using the SocksPort opened by this Tor instance are also disconnected). It is in this state fore more than 4 hours now. I will leave it to see if it recovers by itself. Tried new identity, but does not fix it. Funny the heartbeat counts circuits:
2/24/2018 11:05:35 AM.600 [NOTICE] Heartbeat: Tor's uptime is 2 days 17:59 hours, with 3 circuits open. I've sent 37.58 MB and received 116.72 MB. 2/24/2018 11:05:35 AM.600 [NOTICE] Average packaged cell fullness: 29.597%. TLS write overhead: 5% 2/24/2018 15:15:44 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:16:14 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:18:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:18:50 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:19:06 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:20:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:20:50 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:21:06 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:22:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:22:50 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:23:06 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:26:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:28:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:30:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:32:32 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 16:34:32 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 17:05:35 PM.600 [NOTICE] Heartbeat: Tor's uptime is 2 days 23:59 hours, with 7 circuits open. I've sent 40.25 MB and received 122.63 MB. 2/24/2018 17:05:35 PM.600 [NOTICE] Average packaged cell fullness: 28.339%. TLS write overhead: 5% 2/24/2018 17:12:13 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 2/24/2018 17:16:52 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 2/24/2018 17:38:32 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 2/24/2018 17:41:02 PM.700 [NOTICE] New control connection opened from 127.0.0.1. 2/24/2018 17:43:08 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 2/24/2018 17:46:02 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 17:53:47 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 17:55:00 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:587. Giving up. (waiting for circuit) 2/24/2018 17:55:47 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 17:57:00 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:587. Giving up. (waiting for circuit) 2/24/2018 17:57:47 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 17:59:00 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:587. Giving up. (waiting for circuit) 2/24/2018 17:59:56 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 18:01:56 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 18:03:56 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 2/24/2018 18:20:34 PM.800 [NOTICE] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6051/6077). 2/24/2018 18:20:34 PM.800 [NOTICE] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6051/6077). 2/24/2018 18:20:35 PM.200 [NOTICE] We now have enough directory information to build circuits. 2/24/2018 19:03:18 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
When Tor can't download microdescriptors (#21969 (moved)), maybe it should try authorities or fallbacks (#23863 (moved)), before it runs out of microdesc retries (#24113 (moved)). But even after Tor has the microdescs it needs, it sometimes doesn't start building circuits again. Instead, it is in state "waiting for circuit" (#25347 (moved)).