Opened 2 years ago

Last modified 4 months ago

#23863 new enhancement

When our directory guards don't have each others' microdescs, we should try an authority or fallback

Reported by: teor Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.0.6
Severity: Normal Keywords: tor-guard, tor-bridge, tor-client, 034-triage-20180328, 034-removed-20180328, needs-proposal
Cc: catalyst, isis, bmeson, mrphs, starlight@… Actual Points:
Parent ID: #21969 Points: 1
Reviewer: Sponsor:

Description

If our directory guards don't have each others' microdescriptors, we should mark some of them dead.

But should we mark the one that won't give us the microdescriptor dead?
Or should we mark the one that we can't get a microdescriptor for dead?

Child Tickets

TicketTypeStatusOwnerSummary
#24991defectclosedrelay frequently claiming "missing descriptors for 1/2 of our primary entry guards"

Change History (21)

comment:1 Changed 2 years ago by teor

Here's what we could do:

  1. Try some directory mirrors
  2. Try a fallback
  3. Try an authority
  4. If we still don't have mds for one or more primary guards, mark them dead until the next consensus

This deals with the scenario where:

  1. Authorities make new consensus with new mds (hh:00)
  2. Client bootstraps and downloads consensus from authorities (either at random because they are part of the fallback list, or due to options)
  3. Client chooses directory guards
  4. Client tries directory guards for new mds
  5. Directory guards are waiting for a random time between hh:00 and hh:30 to fetch new consensus and new mds. See https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n3240
Last edited 2 years ago by teor (previous) (diff)

comment:2 in reply to:  1 ; Changed 2 years ago by asn

Replying to teor:

Here's what we could do:

  1. Try some directory mirrors
  2. Try a fallback
  3. Try an authority
  4. If we still don't have mds for one or more primary guards, mark them dead until the next consensus

This deals with the scenario where:

  1. Authorities make new consensus with new mds (hh:00)
  2. Client bootstraps and downloads consensus from authorities (either at random because they are part of the fallback list, or due to options)
  3. Client chooses directory guards
  4. Client tries directory guards for new mds
  5. Directory guards are waiting for a random time between hh:00 and hh:30 to fetch new consensus and new mds. See https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n3240

Seems to me that the ways to deal with the edge case you describe above are:

a) Eventually clients try authorities to fetch mds if all else fails (bad for the health of dirauths). I think that's what you suggested basically.

b) We remove dirauths from the fallback list (less traffic on dirauths. any drawback?)

c) We make dirservers fetch new consensuses/mds much faster than 30mins delay (bad for health of dirauths).

Last edited 2 years ago by asn (previous) (diff)

comment:3 in reply to:  2 ; Changed 2 years ago by teor

Replying to asn:

Replying to teor:

Here's what we could do:

  1. Try some directory mirrors
  2. Try a fallback
  3. Try an authority
  4. If we still don't have mds for one or more primary guards, mark them dead until the next consensus

This deals with the scenario where:

  1. Authorities make new consensus with new mds (hh:00)
  2. Client bootstraps and downloads consensus from authorities (either at random because they are part of the fallback list, or due to options)
  3. Client chooses directory guards
  4. Client tries directory guards for new mds
  5. Directory guards are waiting for a random time between hh:00 and hh:30 to fetch new consensus and new mds. See https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n3240

Seems to me that the ways to deal with the edge case you describe above are:

a) Eventually clients try authorities to fetch mds if all else fails (bad for the health of dirauths). I think that's what you suggested basically.

Yes, we should implement this, if the other fixes don't resolve the md issue.
It's only bad for the authorities if a lot of clients do it all the time.

b) We remove dirauths from the fallback list (less traffic on dirauths. any drawback?)

You can't avoid this issue by stopping clients contacting authorities. Because there are other ways that a client can have a consensus with some microdescs that are not on its guards.

And we already weight dirauths low on the fallback list, so not many clients contact them.

Removing authorities from the fallback list would break clients that disable fallbacks, and clients on non-standard networks. Also, it would break clients if too many fallbacks go down on the public network.

c) We make dirservers fetch new consensuses/mds much faster than 30mins delay (bad for health of dirauths).

You are right, this is bad because it requires every relay to do this, 100% of the time. Which is impossible, as well as being bad for the network.

comment:4 Changed 2 years ago by Sebastian

Load on dirauths is extremely light right now IMO. Having clients contact dirauths would be bad, but having relays contact dirauths early and a bit more aggressively (and maybe having dirauths ultra-aggressively try every other dirauth) doesn't sound like the end of the world to me. I am currently seeing less than 3MB/s (averaged over 30 second intervals) peak outgoing bandwidth on my dirauth which is basically negligible.

comment:5 in reply to:  4 Changed 2 years ago by teor

Replying to Sebastian:

Load on dirauths is extremely light right now IMO. Having clients contact dirauths would be bad, but having relays contact dirauths early and a bit more aggressively (and maybe having dirauths ultra-aggressively try every other dirauth) doesn't sound like the end of the world to me. I am currently seeing less than 3MB/s (averaged over 30 second intervals) peak outgoing bandwidth on my dirauth which is basically negligible.

I am not sure we can have relays try fast enough to make sure this bug never happens on clients. That would cause issues every hour from about hh:00 to hh:01.

Why don't we have clients remember where they got the consensus, and try it for any missing microdescs before trying an authority?

comment:6 Changed 2 years ago by Sebastian

I am thinking the right design for this would be a kind of staged distribution, where relays get to fetch a new consensus before it's valid (but don't use it yet). This might be quite tricky to implement with the current system though :/

comment:7 in reply to:  3 ; Changed 2 years ago by asn

Replying to teor:

Replying to asn:

Seems to me that the ways to deal with the edge case you describe above are:

a) Eventually clients try authorities to fetch mds if all else fails (bad for the health of dirauths). I think that's what you suggested basically.

Yes, we should implement this, if the other fixes don't resolve the md issue.
It's only bad for the authorities if a lot of clients do it all the time.

True. But we have lots of clients, so I think before doing this we might want to calculate the probability of this happening, to try to understand how many clients will end up doing this behavior.

b) We remove dirauths from the fallback list (less traffic on dirauths. any drawback?)

You can't avoid this issue by stopping clients contacting authorities. Because there are other ways that a client can have a consensus with some microdescs that are not on its guards.

True. But it's less likely if dirauths are not in the picture, since basically your edge-case is guaranteed to happen everytime a client randomly picks a dirauth early in the hour (e.g. between hh:00 and hh:05).

And we already weight dirauths low on the fallback list, so not many clients contact them.

Removing authorities from the fallback list would break clients that disable fallbacks, and clients on non-standard networks. Also, it would break clients if too many fallbacks go down on the public network.

Hmm, I don't understand these points exactly. Can you expand? Why would clients break worse than currently if we remove dirauths from fallbacks? We can add a few more relays in the fallbacks to compensate.

comment:8 in reply to:  7 ; Changed 2 years ago by teor

I think we should implement an authority md fetch for clients that run out of microdesc attempts. And I think they can easily handle the load of a few mds, because they are handling a similar consensus load from clients and relays already.

I also don't think removing fallbacks from the list will help much, because bootstrapping clients try authorities anyway.

See below for details.

Replying to asn:

Replying to teor:

Replying to asn:

Seems to me that the ways to deal with the edge case you describe above are:

a) Eventually clients try authorities to fetch mds if all else fails (bad for the health of dirauths). I think that's what you suggested basically.

Yes, we should implement this, if the other fixes don't resolve the md issue.
It's only bad for the authorities if a lot of clients do it all the time.

True. But we have lots of clients, so I think before doing this we might want to calculate the probability of this happening, to try to understand how many clients will end up doing this behavior.

Yes, I think we should estimate how often it will happen. We can afford to have a few thousand clients download a few mds per hour (0.1% of 2 million clients per hour). Because we have a few thousand relays download two consensus flavours and all the new mds from the authorities, and they are handling this load fine.

b) We remove dirauths from the fallback list (less traffic on dirauths. any drawback?)

You can't avoid this issue by stopping clients contacting authorities. Because there are other ways that a client can have a consensus with some microdescs that are not on its guards.

True. But it's less likely if dirauths are not in the picture, since basically your edge-case is guaranteed to happen everytime a client randomly picks a dirauth early in the hour (e.g. between hh:00 and hh:05).

Yes. Directory mirrors download at random between hh:00 and hh:30, so missing microdescriptors are guaranteed to happen for 50% of clients that bootstrap off authorities (9/(150*10 + 9) ~= 0.6% of clients bootstrap off authorities) at hh:15. Assuming that clients bootstrap at random throughout the hour, this is 0.6% * 0.25 = 0.15% of bootstrapping clients per hour. So we can afford to have all these clients try an authority for their mds, because the number of bootstrapping clients is much lower than the number of running clients. (We could afford to have 0.15% of all clients do this, not just 0.15% of the bootstrapping ones.)

The actual figure is slightly higher than this, because after trying 3 fallbacks/authorities, 0.3.2 and later clients try an authority directly. When 10% of fallbacks are down, 0.1% of clients try an authority for this reason. But the authorities are already handling this consensus fetch traffic fine, so an extra few mds won't be an issue.

(For 0.3.1 and earlier clients, 100% try an authority and a fallback straight away when bootstrapping, and they pick whichever wins. So we might want to think a bit harder about backporting #17750 and #23347, if we also want to backport an authority md fetch to earlier versions.)

We could easily reduce the 0.3.2 client authority fetch to 0.115% (0.015% + 0.1%) by weighting the fallbacks at 100 rather than 10. But that doesn't remove the 0.1% that try an authority after 3 fallbacks. So I'm not sure re-weighting (or removing) would have the impact you want.

And we already weight dirauths low on the fallback list, so not many clients contact them.

Removing authorities from the fallback list would break clients that disable fallbacks, and clients on non-standard networks.

These clients would have nothing in the fallback list to bootstrap off, because they don't use the hard-coded public fallbacks. We can avoid this by only removing the authorities when using public fallbacks, but that makes the code hard to test in chutney.

Also, it would break clients if too many fallbacks go down on the public network.

Hmm, I don't understand these points exactly. Can you expand? Why would clients break worse than currently if we remove dirauths from fallbacks? We can add a few more relays in the fallbacks to compensate.

The idea of having authorities in the fallback list is that clients will use them if a large number of the fallbacks break for some reason (for example, a bad bug on mirrors). I am not sure if this actually works, but let's not break it until we are sure:

  • removing them will help this issue, and
  • removing them won't create any other issues.

I don't think removing authorities from the fallback list would help this issue, because 0.1% of bootstrapping clients will still try an authority when they fail 3 fallbacks.

comment:9 in reply to:  8 ; Changed 2 years ago by asn

Replying to teor:

I think we should implement an authority md fetch for clients that run out of microdesc attempts. And I think they can easily handle the load of a few mds, because they are handling a similar consensus load from clients and relays already.

I also don't think removing fallbacks from the list will help much, because bootstrapping clients try authorities anyway.

I'm continuing the discussion here altho it's worth mentioning that teor also added some more calculations in #24113.

I think I can get behind doing an authority md fetch for clients that have failed too many microdesc attempts. To further reduce the load on dirauths, perhaps we should do this only if we are missing descriptors for some of our primary guards (i.e. only if we are missing very crucial mds), since clients can/should usually tolerate missing a few random mds.

If we agree on the general concept here, I will come up with an implementation plan early next week.

comment:10 in reply to:  9 Changed 2 years ago by teor

Replying to asn:

Replying to teor:

I think we should implement an authority md fetch for clients that run out of microdesc attempts. And I think they can easily handle the load of a few mds, because they are handling a similar consensus load from clients and relays already.

I also don't think removing fallbacks from the list will help much, because bootstrapping clients try authorities anyway.

I'm continuing the discussion here altho it's worth mentioning that teor also added some more calculations in #24113.

I think I can get behind doing an authority md fetch for clients that have failed too many microdesc attempts. To further reduce the load on dirauths, perhaps we should do this only if we are missing descriptors for some of our primary guards (i.e. only if we are missing very crucial mds), since clients can/should usually tolerate missing a few random mds.

I think asking an authority is a good idea.
Is it also worth asking a fallback first?
This might be another way to reduce load on the authorities.
And I think it would really help some clients if we do it, because some networks block authority addresses.

If we only ask an authority or fallback when we are missing a guard microdesc, this leaks our guards to the authority or fallback.
I think that is probably ok. Because these queries are mixed in with a bunch of other client queries.
(Authorities see about as many client queries as they see relay queries.)

But here's what we can do to make the leak less obvious:

  • ask for all the missing microdescs, not just the primary guard ones
    • this has a very low impact, because we are already doing a request - we should definitely do it.
  • ask all the time, not just when we are missing primary guards
    • this has a higher impact, but I think we can easily afford to do it if we want to,
    • but I agree with you - I don't think we need to do it, so let's not bother right now.

Some detailed questions about the md request:

What if we are missing more microdescs than fit in a single request?
How do we make sure our primary guards are in that request?

What order do we usually use for md hashes in requests?
When we make multiple requests, do we usually split mds between them at random?
Do we usually sort the hashes to destroy ordering information?

(I can imagine myself writing a request that starts with the guard md hashes, and not realising I was leaking them.)

comment:11 Changed 2 years ago by teor

Summary: When our directory guards don't have each others' microdescs, we should mark some deadWhen our directory guards don't have each others' microdescs, we should try an authority or fallback

Some more detailed design questions, after reviewing #23817:

  • what should we do when we are using bridges, or all the authorities and fallbacks are excluded by an EntryNodes setting?
    • should we fetch mds from a fallback or an authority over a 3-hop path?
    • is this what bridge clients do already, or do they give up when they can't get something from their bridge(s)?
  • if we are willing to fetch missing microdescs over a 3-hop path, can we make should_set_md_dirserver_restriction() always return 1?

comment:12 Changed 22 months ago by asn

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

comment:13 Changed 21 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final
Type: defectenhancement

Label a bunch of (arguable and definite) enhancements as enhancements for 0.3.4.

comment:14 Changed 20 months ago by s7r

My laptop's TBB 7.5 just got hit by this, I think. It is using default guard context, no bridges. I am not sure if it's this or #21969. Basically, it is running but being useless (can't load any web page, other apps using the SocksPort opened by this Tor instance are also disconnected). It is in this state fore more than 4 hours now. I will leave it to see if it recovers by itself. Tried new identity, but does not fix it. Funny the heartbeat counts circuits:

2/24/2018 11:05:35 AM.600 [NOTICE] Heartbeat: Tor's uptime is 2 days 17:59 hours, with 3 circuits open. I've sent 37.58 MB and received 116.72 MB. 
2/24/2018 11:05:35 AM.600 [NOTICE] Average packaged cell fullness: 29.597%. TLS write overhead: 5% 
2/24/2018 15:15:44 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:16:14 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:18:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:18:50 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:19:06 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:20:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:20:50 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:21:06 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:22:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:22:50 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:23:06 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:26:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:28:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:30:30 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:32:32 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 16:34:32 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 17:05:35 PM.600 [NOTICE] Heartbeat: Tor's uptime is 2 days 23:59 hours, with 7 circuits open. I've sent 40.25 MB and received 122.63 MB. 
2/24/2018 17:05:35 PM.600 [NOTICE] Average packaged cell fullness: 28.339%. TLS write overhead: 5% 
2/24/2018 17:12:13 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 
2/24/2018 17:16:52 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 
2/24/2018 17:38:32 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 
2/24/2018 17:41:02 PM.700 [NOTICE] New control connection opened from 127.0.0.1. 
2/24/2018 17:43:08 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) 
2/24/2018 17:46:02 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 17:53:47 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 17:55:00 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:587. Giving up. (waiting for circuit) 
2/24/2018 17:55:47 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 17:57:00 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:587. Giving up. (waiting for circuit) 
2/24/2018 17:57:47 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 17:59:00 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:587. Giving up. (waiting for circuit) 
2/24/2018 17:59:56 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 18:01:56 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 18:03:56 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit) 
2/24/2018 18:20:34 PM.800 [NOTICE] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6051/6077). 
2/24/2018 18:20:34 PM.800 [NOTICE] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6051/6077). 
2/24/2018 18:20:35 PM.200 [NOTICE] We now have enough directory information to build circuits. 
2/24/2018 19:03:18 PM.600 [NOTICE] Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Last edited 20 months ago by s7r (previous) (diff)

comment:15 Changed 20 months ago by teor

Keywords: 033-maybe-must added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.3.x-final

Here are all the related tickets:

When Tor can't download microdescriptors (#21969), maybe it should try authorities or fallbacks (#23863), before it runs out of microdesc retries (#24113). But even after Tor has the microdescs it needs, it sometimes doesn't start building circuits again. Instead, it is in state "waiting for circuit" (#25347).

comment:16 Changed 20 months ago by teor

Keywords: 033-maybe-must removed
Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

I think we can fix this issue by fixing #25347 in 0.3.3 and backporting.

comment:17 Changed 19 months ago by nickm

Keywords: 034-triage-20180328 added

comment:18 Changed 19 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:19 Changed 19 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:20 Changed 17 months ago by starlight

Cc: starlight@… added

comment:21 Changed 4 months ago by teor

Keywords: needs-proposal added

#16844 and #21969 have conflicting goals, so we need to write a proposal that balances these goals. See #30817.

Note: See TracTickets for help on using tickets.