#24113 closed defect (fixed)

We stop trying to download an md after 8 failed tries

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: 0.3.4.x-final
Component: Core Tor/Tor Version: Tor: 0.3.0.6
Severity: Normal Keywords: tor-guard, tor-bridge, tor-client
Cc: catalyst, isis, bmeson, mrphs, cypherpunks Actual Points:
Parent ID: #23814 Points:
Reviewer: Sponsor:

Description

The config var TestingMicrodescMaxDownloadTries specifies the amount of times we are willing to try to download an md before we give up. It's set to 8 on the real network and 80 on the testnet.

This interacts badly with #21969 since if we fail to fetch a primary guard md more than 8 times we will give up on it and refuse to bootstrap.

Child Tickets

Change History (15)

comment:1 Changed 18 months ago by cypherpunks

Cc: cypherpunks added

comment:2 Changed 18 months ago by asn

Perhaps this restriction should not apply if the failed md is one of the primary guard descs? This is similar issue with #23985 where we treat all mds as of equal importance.

Perhaps we should introduce a is_this_md_important() function which checks if it's the md of a primary guard, and treat it differently if so?

comment:3 Changed 18 months ago by teor

In the past, we have added a new option, like TestingImportantMicrodescMaxDownloadTries.

comment:4 Changed 17 months ago by asn

Actually, do we even want this feature? Isn't the exponential backoff what we want to do here, and we are already doing it? So what's the point of doing both exponential backoff and completely blocking fetches after 8 tries?

Furthermore, this seems to also apply to dirservers, and we don't want dirservers to abandon md downloads since then they stay incomplete.

Do we have a reason for keeping this feature? Perhaps there is a reason to have this in testing networks (hence the Testing in the name)? I'm trigger happy here.

Last edited 17 months ago by asn (previous) (diff)

comment:5 Changed 17 months ago by teor

We can only remove the retry limit if we are sure exponential backoff works.
(There is still a global retry limit of 255 for every individual directory document.)

Why not work out the actual number of retries we need, and increase it to that?

(I would guess that we shouldn't retry a single md more than 16 or 20 times, but that's just a guess.)

Or, let's set the limit lower (say 6?) and try an authority when we reach it, then stop.

comment:6 in reply to:  5 ; Changed 17 months ago by asn

Replying to teor:

We can only remove the retry limit if we are sure exponential backoff works.
(There is still a global retry limit of 255 for every individual directory document.)

Hm. Are you saying that there is a chance that exponential backoff doesn't work?

Why not work out the actual number of retries we need, and increase it to that?

(I would guess that we shouldn't retry a single md more than 16 or 20 times, but that's just a guess.)

Is there actually a number of retries that guarantees to give us mds? I don't think so, especially when you consider edge-cases like comment:1:ticket:23863.

comment:7 in reply to:  6 Changed 17 months ago by teor

Replying to asn:

Replying to teor:

We can only remove the retry limit if we are sure exponential backoff works.
(There is still a global retry limit of 255 for every individual directory document.)

Hm. Are you saying that there is a chance that exponential backoff doesn't work?

Yes, we have regular bugs in this subsystem.

And yes, if application activity keeps on resetting the exponential backoff on md fetches, we will reach the download limit.

Why not work out the actual number of retries we need, and increase it to that?

(I would guess that we shouldn't retry a single md more than 16 or 20 times, but that's just a guess.)

Is there actually a number of retries that guarantees to give us mds? I don't think so, especially when you consider edge-cases like comment:1:ticket:23863.

Yes. Almost all clients will get them after 12 tries, but it will take them hours.

At hh:00, no tries will ever work.
At hh:01, 30 tries.
At hh:02, 15 tries.
At hh:03, 10 tries.
At hh:06, 5 tries.
At hh:10, 3 tries.
At hh:15, 2 tries.
From hh:30 to hh:59, 1 try.

Exponential backoff means that clients will try their 8 attempts after this many seconds on average (last + 1 + (1 + 4*last))/2:
0, 1, 3, 8, 21, 53, 133, 333 (8 tries), 833, 2083

For a client that started at hh:00, this means they try in minute:
0, 0, 0, 0, 0, 1, 2, 6 (8 tries), 14, 35

This gives them about a 30% chance of fetching a new md after 8 tries and 6 minutes in the worst case scenario. Or a near-100% chance of fetching a new md after 10 tries and 35 minutes in the worst case scenario. (The analysis gets complex after this, because times over 60 minutes wrap around to the new consensus.)

Of course, only 1.5% of clients experience the worst case scenario: clients that bootstrap later in the hour have a much better experience.

So here are your options:

If you want all clients to get guard mds after 30 seconds, you should make them try an authority on the 4th try. (Remember, the 5th try is an average of 21 seconds.)

If you want all clients to get all their mds after 1 minute, you should make them try an authority on the 5th try.

If want all clients to get all their mds after 30 minutes, you should make them try 8 or 9 times, and then make the remainder try an authority. (Which is a strange coincidence, because I bet someone guessed the default of 8 tries when they were designing the old fixed-delay schedules.)

(Not trying an authority is either slow, or it is unreliable, or both.)

Edit: fix some minor mistakes in the figures, and explain how many clients are hit with worst-case

Last edited 17 months ago by teor (previous) (diff)

comment:8 Changed 17 months ago by teor

After reviewing #23817, I have a question:

  • should we make the number of guards in should_set_md_dirserver_restriction() less than the number of microdesc fetches?

comment:9 Changed 16 months ago by asn

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

comment:10 Changed 14 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

comment:11 Changed 14 months ago by teor

Keywords: 033-maybe-must added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.3.x-final

We are potentially seeing this issue in 0.3.2.9, see the logs on #21969.

When Tor can't download microdescriptors (#21969), maybe it should try authorities or fallbacks (#23863), before it runs out of microdesc retries (#24113). But even after Tor has the microdescs it needs, it sometimes doesn't start building circuits again. Instead, it is in state "waiting for circuit" (#25347).

comment:12 Changed 14 months ago by teor

Keywords: 033-maybe-must removed
Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

I think we just need to fix #25347 in 0.3.3.

comment:13 Changed 14 months ago by teor

Status: newneeds_information

Merging #23814 will remove MaxDownloadTries and obsolete this ticket.

comment:14 Changed 14 months ago by teor

Parent ID: #21969#23814
Status: needs_informationnew

comment:15 Changed 14 months ago by nickm

Resolution: fixed
Status: newclosed

#23814 is now merged; this is now obsolete.

Note: See TracTickets for help on using tickets.