Tor re-tries directory mirrors that it knows are missing microdescriptors

changed milestone to %Tor: 0.3.1.x-final

added 031-backport? 032-backport? component::core tor/tor milestone::Tor: 0.3.1.x-final parent::21969 priority::medium prop224 resolution::fixed severity::normal spec-change status::closed tor-guard tor-hs type::defect labels

Trac:
Parent: N/A to #21969 (moved)

This issue occurs because a directory guard that is missing microdescriptors we need is still considered live. We only mark a guard dead when we fail to connect to it. (When all guards are marked dead, we use a fallback or authority to fetch microdescriptors).

Replying to teor:

This issue occurs because a directory guard that is missing microdescriptors we need is still considered live. We only mark a guard dead when we fail to connect to it. (When all guards are marked dead, we use a fallback or authority to fetch microdescriptors).

IIUC a directory guard can be naturally missing microdescriptors because they still have not propagated to the network sufficiently, so marking it as a dead guard at that point is probably an overkill (aka we might end up marking many (directory) guards as dead for no reason). Perhaps you mean marking it as dead for the purpose of fetching that specific microdescriptor and just skip to the next directory guard for fetching that md? I think some sort of logic like that would be useful, especially when considering important mds like the primary guard mds (#21969 (moved)).

Perhaps what I'm suggesting above is your #23863 (moved)? Seems like these two tickets are connected.

Here is a draft of a suggested technical solution here which involves a cache tracking failed md downloads:

Introduce an md_failure_cache cache which tracks failed or missing mds downloads per relays. That is a cache with entries as follows: "We failed to fetch md from relays X, Y, Z".

engineering XXX: how to figure out missing mds from received batches

When we want to fetch a batch of mds from relay X, check md_failure_cache to see if any of the batched mds is included in the cache for relay X. if that's the case, add relay X to the excluded list of nodes for dirservers, so that our directory subsystems skips relay X and moves to the next one in line.

engineering XXX: is there an exclude list for dirserver downloads?

When we succeed in fetching an md, remove its entries from the cache (so that we retry in the future from relay X which is probably our top primary guard).
Finally, to address #23863 (moved), when we detect that we are missing mds from our primary guards, order an immediate download of those missing mds, so that the previous steps get done.

I'm inlining some logs demonstrating this issue for completeness.

Tor failed to fetch the md with hash AGkG6zefrl7Uvkqo+nfCI1fDbxjvSyzsNS5TiWa0JeU. Tor's dirguards are:

79.137.112.5
54.36.38.63
86.105.212.130

Here are some logs demonstrating the issue:

Oct 25 16:10:10.000 [warn] Received answer to microdescriptor request (status 200, body size 254796) from server '79.137.112.5:443'
Oct 25 16:10:10.000 [warn] Initially requested 500 mds, got 495, missed 5
Oct 25 16:10:10.000 [warn] AGkG6zefrl7Uvkqo+nfCI1fDbxjvSyzsNS5TiWa0JeU failed 1 time(s); I'll try again in 1 seconds. 
...
Oct 25 16:10:11.000 [warn] Received answer to microdescriptor request (status 404, body size 0) from server '79.137.112.5:443'
Oct 25 16:10:11.000 [warn] AGkG6zefrl7Uvkqo+nfCI1fDbxjvSyzsNS5TiWa0JeU failed 2 time(s); I'll try again in 3 seconds.
...
Oct 25 16:10:15.000 [warn] Received answer to microdescriptor request (status 404, body size 0) from server '79.137.112.5:443'
Oct 25 16:10:15.000 [warn] AGkG6zefrl7Uvkqo+nfCI1fDbxjvSyzsNS5TiWa0JeU failed 3 time(s); I'll try again in 9 seconds.
...
Oct 25 16:11:15.000 [warn] Received answer to microdescriptor request (status 200, body size 5182) from server '54.36.38.63:443'
Oct 25 16:11:15.000 [warn] AGkG6zefrl7Uvkqo+nfCI1fDbxjvSyzsNS5TiWa0JeU failed 4 time(s); I'll try again in 15 seconds.
...
Oct 25 16:12:15.000 [warn] Received answer to microdescriptor request (status 404, body size 0) from server '54.36.38.63:443'
Oct 25 16:12:15.000 [warn] AGkG6zefrl7Uvkqo+nfCI1fDbxjvSyzsNS5TiWa0JeU failed 5 time(s); I'll try again in 38 seconds.
...
Oct 25 16:13:15.000 [warn] Sent request to directory server '86.105.212.130:443': (purpose: 19, request size: 1123, payload size: 0)
Oct 25 16:13:16.000 [warn] Received answer to microdescriptor request (status 200, body size 10713) from server '86.105.212.130:443'

We basically see Tor failing to fetch the same md from 79.137.112.5:443 three times within 2 minutes, and then it fails to fetch from 54.36.38.63:443 another two times within 2 minutes, until it tries 86.105.212.130 which succeeds (that dircache is probably more up-to-date than the other two).

If a fail cache was in place, we would have gone to 86.105.212.130 on our third try, instead of our sixth.

Here is an implementation plan of the failure cache idea from comment:4 .

First of all, the interface of the failure cache:

We introduce a digest256map_t *md_fetch_fail_cache which maps the 256-bit md hash to a smartlist of dirguards thru which we failed to fetch the md.

Now the code logic:

We populate md_fetch_fail_cache with dirguards in dir_microdesc_download_failed(). We remove them from the failure cache in microdescs_add_to_cache() when we successfuly add an md to the cache.
We add another entry_guard_restriction_t restriction type in guards_choose_dirguard(). We currently have one restriction type which is designed to restrict guard nodes based on the exit node choice and its family. We want another type which uses a smartlist and restricts dirguards based on whether we have failed to fetch an md from that dirguard. We are gonna use this in step 3.
In directory_get_from_dirserver() we query the md failure cache and pass any results to directory_pick_generic_dirserver() and then to guards_choose_dirguard() which uses the new restriction type to block previously failed dirguards from being selected.

How does this feel like to you?

There are two more steps we might want to do:

When we find that we are missing descs for our primary guards, we order an immediate download of the missing descs so that the above algorithm takes place.
If we fail to fetch the mds from all of our primary guards, we retry using fallback directories instead of trying deeper down our guard list. Teor suggested this, but it seems to be far from trivial to implement given the interface of our guard subsystem. If it provides big benefit we should consider it.

Replying to asn:

Here is an implementation plan of the failure cache idea from comment:4 .

First of all, the interface of the failure cache:

We introduce a digest256map_t *md_fetch_fail_cache which maps the 256-bit md hash to a smartlist of dirguards thru which we failed to fetch the md.

Now the code logic:

We populate md_fetch_fail_cache with dirguards in dir_microdesc_download_failed(). We remove them from the failure cache in microdescs_add_to_cache() when we successfuly add an md to the cache.

Successfully add that md to the cache? Or any md from that dirguard?

I think this is ok, as long as we ask for mds in large enough batches.

We add another entry_guard_restriction_t restriction type in guards_choose_dirguard(). We currently have one restriction type which is designed to restrict guard nodes based on the exit node choice and its family. We want another type which uses a smartlist and restricts dirguards based on whether we have failed to fetch an md from that dirguard. We are gonna use this in step 3.

In directory_get_from_dirserver() we query the md failure cache and pass any results to directory_pick_generic_dirserver() and then to guards_choose_dirguard() which uses the new restriction type to block previously failed dirguards from being selected.

Do we block dirguards that have failed to deliver an md from downloads of that md? Or do we block dirguards that have failed to deliver any mds from downloads of any md?

How does this feel like to you?

There are two more steps we might want to do:

When we find that we are missing descs for our primary guards, we order an immediate download of the missing descs so that the above algorithm takes place.

This seems ok, we will end up leaking our primary guards to our dirguards. To reduce the exposure, we should try to trigger this as soon as we get the consensus, and batch these in with other md downloads. (That is, if we have any mds waiting, we should retry all of them, not just the one or two primary guard mds.)

If we fail to fetch the mds from all of our primary guards, we retry using fallback directories instead of trying deeper down our guard list. Teor suggested this, but it seems to be far from trivial to implement given the interface of our guard subsystem. If it provides big benefit we should consider it.

This is #23863 (moved), we get this behaviour automatically if we mark the guards as down (or change the definition of a "down" guard in directory_pick_generic_dirserver() so it checks the md failure cache).

Trac:
Keywords: N/A deleted, 032-backport? 031-backport? added

Replying to teor:

Replying to asn:

Here is an implementation plan of the failure cache idea from comment:4 .

First of all, the interface of the failure cache:

We introduce a digest256map_t *md_fetch_fail_cache which maps the 256-bit md hash to a smartlist of dirguards thru which we failed to fetch the md.

Now the code logic:

We populate md_fetch_fail_cache with dirguards in dir_microdesc_download_failed(). We remove them from the failure cache in microdescs_add_to_cache() when we successfuly add an md to the cache.

Successfully add that md to the cache? Or any md from that dirguard?

I meant, we remove that md from the md_fetch_fail_cache if we manage to fetch that md from any dir.

I think this is ok, as long as we ask for mds in large enough batches.

We add another entry_guard_restriction_t restriction type in guards_choose_dirguard(). We currently have one restriction type which is designed to restrict guard nodes based on the exit node choice and its family. We want another type which uses a smartlist and restricts dirguards based on whether we have failed to fetch an md from that dirguard. We are gonna use this in step 3.

In directory_get_from_dirserver() we query the md failure cache and pass any results to directory_pick_generic_dirserver() and then to guards_choose_dirguard() which uses the new restriction type to block previously failed dirguards from being selected.

Do we block dirguards that have failed to deliver an md from downloads of that md? Or do we block dirguards that have failed to deliver any mds from downloads of any md?

Yes, that's a good question that I forgot to address in this proposal.

I guess my design above was suggesting that we block dirguards "that have failed to deliver any mds from downloads of any md", until those mds get fetched from another dirserver and get removed from the failure cache.

That kinda penalizes dirservers that dont have a totally up-to-date md cache, which perhaps kinda makes sense. But perhaps before we become so strict, we should check whether we can improve the dirserver logic of fetching new mds so that they are almost always up-to-date if possible? Or we can do this last part after we implement the failure cache proposal?

Replying to asn:

Replying to teor:

Replying to asn:

Here is an implementation plan of the failure cache idea from comment:4 .

First of all, the interface of the failure cache:

We introduce a digest256map_t *md_fetch_fail_cache which maps the 256-bit md hash to a smartlist of dirguards thru which we failed to fetch the md.

Now the code logic:

We populate md_fetch_fail_cache with dirguards in dir_microdesc_download_failed(). We remove them from the failure cache in microdescs_add_to_cache() when we successfuly add an md to the cache.

Successfully add that md to the cache? Or any md from that dirguard?

I meant, we remove that md from the md_fetch_fail_cache if we manage to fetch that md from any dir.

I think this is ok, as long as we ask for mds in large enough batches.

We add another entry_guard_restriction_t restriction type in guards_choose_dirguard(). We currently have one restriction type which is designed to restrict guard nodes based on the exit node choice and its family. We want another type which uses a smartlist and restricts dirguards based on whether we have failed to fetch an md from that dirguard. We are gonna use this in step 3.

In directory_get_from_dirserver() we query the md failure cache and pass any results to directory_pick_generic_dirserver() and then to guards_choose_dirguard() which uses the new restriction type to block previously failed dirguards from being selected.

Do we block dirguards that have failed to deliver an md from downloads of that md? Or do we block dirguards that have failed to deliver any mds from downloads of any md?

Yes, that's a good question that I forgot to address in this proposal.

I guess my design above was suggesting that we block dirguards "that have failed to deliver any mds from downloads of any md", until those mds get fetched from another dirserver and get removed from the failure cache.

I think this is the behaviour we want: trying each dir server for each specific md will mean that we time out, because there are more mds than there are dir servers.

There are two cases when this will cause us to fall back to the authorities:

we download a new consensus from the authorities, and they are the only ones with some of the mds in it
for some reason, an md referenced in the consensus is not mirrored correctly by relays or authorities (this would be a serious bug in tor)

To avoid this happening when it isn't necessary, we should expire failure cache entries after a random time. Maybe it should be the time when we expect dir servers to fetch a new consensus and new mds. I think this is 1-2 hours, but check dir-spec for the exact details. Or we could expire md failure caches each time we get a new consensus. That would be easier.

That kinda penalizes dirservers that dont have a totally up-to-date md cache, which perhaps kinda makes sense. But perhaps before we become so strict, we should check whether we can improve the dirserver logic of fetching new mds so that they are almost always up-to-date if possible? Or we can do this last part after we implement the failure cache proposal?

No, it's not possible to improve this behaviour without major changes to tor. A directory server can't fetch new mds until it fetches a new consensus that references them. And these fetches are done at random times to avoid flooding authorities with queries.

Replying to teor:

Replying to asn:

Replying to teor:

Replying to asn:

Here is an implementation plan of the failure cache idea from comment:4 .

First of all, the interface of the failure cache:

We introduce a digest256map_t *md_fetch_fail_cache which maps the 256-bit md hash to a smartlist of dirguards thru which we failed to fetch the md.

Now the code logic:

We populate md_fetch_fail_cache with dirguards in dir_microdesc_download_failed(). We remove them from the failure cache in microdescs_add_to_cache() when we successfuly add an md to the cache.

Successfully add that md to the cache? Or any md from that dirguard?

I meant, we remove that md from the md_fetch_fail_cache if we manage to fetch that md from any dir.

I think this is ok, as long as we ask for mds in large enough batches.

We add another entry_guard_restriction_t restriction type in guards_choose_dirguard(). We currently have one restriction type which is designed to restrict guard nodes based on the exit node choice and its family. We want another type which uses a smartlist and restricts dirguards based on whether we have failed to fetch an md from that dirguard. We are gonna use this in step 3.

In directory_get_from_dirserver() we query the md failure cache and pass any results to directory_pick_generic_dirserver() and then to guards_choose_dirguard() which uses the new restriction type to block previously failed dirguards from being selected.

Do we block dirguards that have failed to deliver an md from downloads of that md? Or do we block dirguards that have failed to deliver any mds from downloads of any md?

Yes, that's a good question that I forgot to address in this proposal.

I guess my design above was suggesting that we block dirguards "that have failed to deliver any mds from downloads of any md", until those mds get fetched from another dirserver and get removed from the failure cache.

I think this is the behaviour we want: trying each dir server for each specific md will mean that we time out, because there are more mds than there are dir servers.

There are two cases when this will cause us to fall back to the authorities:

we download a new consensus from the authorities, and they are the only ones with some of the mds in it

for some reason, an md referenced in the consensus is not mirrored correctly by relays or authorities (this would be a serious bug in tor)

To avoid this happening when it isn't necessary, we should expire failure cache entries after a random time. Maybe it should be the time when we expect dir servers to fetch a new consensus and new mds. I think this is 1-2 hours, but check dir-spec for the exact details. Or we could expire md failure caches each time we get a new consensus. That would be easier.

ACK. Based on our discussion above, I think the implementation can be simplified since we decided to go with the "block dirguards that have failed to deliver any mds from downloads of any md" approach:

Hence, instead of keeping track of which mds we failed to download from which relays, we can just keep track of the relays that have outdated md information without caring about which mds they failed to serve us. This simplifies things a lot from an engineering perspective since we don't need to map mds to relays. We can just keep an outdated_dirserver_list smartlist with the list of outdated dirservers and ignore those when we try to fetch mds.

This is also consistent with my empirical experience, where most dirservers will usually provide us with all the mds we asked them, except from a few dirservers which would refuse to give us 10 mds or so for a while. So we can just ban those dirservers for a while using the above logic.

And I agree with you, we should wipe the outdated_dirservers_list list everytime we fetch a new consensus. Should we also clear it at any other point? Maybe we should not let it grow to a huge size (cap it?).

I started implementing the original implementation plan today, before I realized the above simplification, so I will have to change some code to do this new design. I estimate that I will have an initial branch here early next week including unittests.

Onwards!

Replying to asn:

And I agree with you, we should wipe the outdated_dirservers_list list everytime we fetch a new consensus. Should we also clear it at any other point? Maybe we should not let it grow to a huge size (cap it?).

If we are clearing it on every consensus, and doing exponential backoffs (average 2.5x) on every failure, then the list can't grow much beyond log_2(36060) = 14 entries. (After 3 hours, we should either stop trying because our consensus is not live, or we should get another consensus and clear the list. Or, if we keep trying with reasonably live consensuses, the limit will be log_2(246060) = 17 entries.) So let's log a BUG() warning when the list has > 30 entries, because something is wrong with our logic if that happens.

I'm not sure if we should clear the list when it gets too big: it seems like that could trigger slow zombie behaviour, if clients keep on retrying a microdescriptor that doesn't exist. So we have to do something different at that point. And we have to set the point early enough that clients get all the microdescs they need within a minute or so.

So, when is the list too big? (Or when have we waited too long?) Should we mark guards down if they depend on a microdesc we can't get? Should we give up until the next consensus? Should we try a fallback or an authority? Should we do something different if we're waiting for an important (primary guard) microdesc?

Maybe we need another ticket to answer these questions. Because I think some of them can wait until we see how this change performs.

Hello. I pushed my branch at bug23817 in my repo.

Some info:

It's a basic failure cache that works in the most basic way: If a dirserver fails to serve an md, we blacklist it until we get a new consensus and skip it. Is this too harsh?
We don't do this for bridges because we might have only one dirserv basically.
We log a warn if the list grows more than 30 elements long because of comment:12.

The code is pretty clean I think and includes a unittest.

Trac:
Status: new to needs_review

This also warrants a spec change.

Trac:
Keywords: 032-backport? 031-backport? deleted, 032-backport? 031-backport? spec-change added

Quick notes, not comprehensive:

The changes file doesn't say which version this is a bugfix on. (Does "make check" pass?)
Should this be a backport candidate for any earlier versions?
It looks like the first two commits on this branch are actually part of #23862 (moved), and already merged into 0.3.2?
Minor: The relay_digest argument to dir_microdesc_download_failed() would be better named "dir_id" or something, since it's the identity of the failing directory server?
Digestmap_t would be a better type to use for outdated_dirserver_list, but we can fix that later if need be.
Should entries on outdated_dirserver_list expire after some amount of time, in addition to just "whenever we get the next consensus?"
I wonder if we can think of a better name for this than "outdated dirserver" -- it's not outdated itself -- it just failed to give us microdescriptors for some consensus we had.
Do we check to make sure that the consensus is fairly recent before marking a dirserver as "outdated" in this way?
Maybe we should make sure that we never mark an authority as outdated?

Adding teor in cc -- asn asked if you could review this patch too. I'd like to merge it in the alpha coming out in a couple of days, if possible, though I understand you might not be able to review on that schedule.

Trac:
Cc: N/A to teor

Trac:
Cc: teor to teor, asn

Replying to nickm:

Quick notes, not comprehensive:

The changes file doesn't say which version this is a bugfix on. (Does "make check" pass?)

Some release early in the 0.3.0 series, probably 0.3.0.1-alpha.

Should this be a backport candidate for any earlier versions?

0.3.0 and 0.3.1.

It looks like the first two commits on this branch are actually part of #23862 (moved), and already merged into 0.3.2?

Minor: The relay_digest argument to dir_microdesc_download_failed() would be better named "dir_id" or something, since it's the identity of the failing directory server?

Digestmap_t would be a better type to use for outdated_dirserver_list, but we can fix that later if need be.

Should entries on outdated_dirserver_list expire after some amount of time, in addition to just "whenever we get the next consensus?"

This is potentially dangerous, because it could create slow zombies. If we do this, 1 hour (consensus interval), 3 hours (live consensus), or 24 hours (reasonably live consensus) seem like reasonable expiry times, depending on when we stop fetching microdescs.

I wonder if we can think of a better name for this than "outdated dirserver" -- it's not outdated itself -- it just failed to give us microdescriptors for some consensus we had.

It's a dirserver that has a consensus that we don't.

Do we check to make sure that the consensus is fairly recent before marking a dirserver as "outdated" in this way?

We should check if our consensus is recent, and try to get a newer one.

Maybe we should make sure that we never mark an authority as outdated?

Yes, I think this is #23863 (moved): rather than marking an authority outdated, we should mark the relay with the missing microdesc as dead. Until then, we should log a warning with the authority and specific microdesc id rather than marking the authority as dead. (Is it possible for an authority to be out of sync and not serve the latest microdescs? What should happen is that it should adopt the latest majority consensus, download its microdescs, and serve them.)

Adding teor in cc -- asn asked if you could review this patch too. I'd like to merge it in the alpha coming out in a couple of days, if possible, though I understand you might not be able to review on that schedule.

Referencing a trac comment is fragile, let's summarise the exponential backoff reasoning instead:

(see
+ * #23817 comment:12 for teor's argument)

Something like "since we use exponential backoff for directory fetches, clients should not fetch microdescs more than log2(33600) = 14 times per consensus. The backoff is random between last+1 and last4, so we allow some extra fetches.".

Would it help to have the log contain the missing microdesc IDs, or at least the first N of them?

relay_is_outdated_dirserver should use smartlist_contains_string for consistency with relay_note_outdated_dirserver.

OK, I pushed a branch bug23817_v2 with the following changes:

It's now rebased on latest master (I will also prepare an 032 branch)
Renames relay_digest to dir_id in directory.c.
Adds a log message for failed mds.
Does not add dirauths as outdated dirservers.
Cleans up the outdated list if it grows above 30 elements.

A few notes:

I'm not checking if we have a recent consensus when we blacklist dirservers, as Nick suggested, which might actually be a problem. I think teor's suggestion of "check if our consensus is recent and try to get a newer one" is a whole project of its own that warrants its own ticket (or not?). What should we do here? Maybe we should not register outdated dirservers if we don't have a live consensus?
teor, I'm not sure if I agree with your exponential backoff + list size argument, since IIUC the exponential backoff applies only when we fail to fetch the same md multiple times, whereas the outdated list can overgrow just by failing many fetches for different mds. In any case, I opted for resetting the list after 30 arguments, which is a bit arbitrary but perhaps can save us from any surprising issues (?).
I did not reset the list based on elapsed time because of teor's argument about clients fetching consensuses every 1 hour or so. Let me know if you don't like this.
I renamed relay_digest to dir_id in directory.c. You want me to do the same in the microdesc.c fail cache?

Tor re-tries directory mirrors that it knows are missing microdescriptors

Child items ...

Activity