Opened 5 years ago

Closed 4 years ago

Last modified 3 years ago

#6752 closed enhancement (implemented)

TestingTorNetwork doesn't lower the dir fetch retry schedules

Reported by: arma Owned by:
Priority: High Milestone: Tor: 0.2.5.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client small-feature
Cc: robgjansen, karsten, cwacek Actual Points:
Parent ID: #7172 Points:
Reviewer: Sponsor:

Description

https://trac.torproject.org/projects/tor/ticket/6341#comment:34 shows a lot of socks timeouts from Tor clients in a Testing Tor Network. Apparently these clients didn't get enough directory info to establish circuits, so they just fail all their application requests. The issue is apparently exacerbated by #3196 where we demanded more descriptors be present before we consider ourselves bootstrapped.

Perhaps the real problem here is that we keep the normal dir fetch retry schedules even when TestingTorNetwork is set? It looks like TestingTorNetwork makes a new consensus every 5 minutes, but client_dl_schedule is "0, 0, 60, 60*5, 60*10, INT_MAX".

Should we lower the retry schedules?

Has it been the case this whole time that clients in testing networks typically don't have all the descriptors they'd want?

Child Tickets

Change History (48)

comment:1 follow-up: Changed 5 years ago by Sebastian

puppetor had a hupuntilup function or something to work around this. Maybe that's a quick fix for shadow, too

comment:2 in reply to: ↑ 1 Changed 5 years ago by arma

Replying to Sebastian:

puppetor had a hupuntilup function or something to work around this. Maybe that's a quick fix for shadow, too

Hupping Tor doesn't change its dir fetching plans these days. So I bet that hack doesn't work in puppetor anymore either.

comment:3 Changed 5 years ago by arma

rransom suggested "0, 0, 15, 30, 60, 75, INT_MAX" as the replacement dir fetching schedule, noting "The last fetch should be at 3 minutes with that schedule (pretending for a moment that dir fetch attempts succeed or fail instantaneously)."

comment:4 Changed 5 years ago by arma

I wonder if we'd be happier with a hack that, rather than specifying alternate schedules, just divides all the interval values by 12 if TestingTorNetwork is on (since 60/5 = 12).

comment:5 Changed 5 years ago by arma

routerlist.c:#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST (10*60)

This one is another huge problem. It means that any client using TestingTorNetwork but where directory_fetches_dir_info_early(options) is false has to win its directory fetches on the first try or it'll be ten more minutes (two consensus periods) until it tries again.

comment:6 Changed 5 years ago by nickm

I'm fine with any of the approaches described above in 0.2.4, though I think that rransom's replacement dir fetch interval idea is probably cleaner than a divide-by-five.

comment:7 Changed 5 years ago by nickm

  • Keywords tor-client added

comment:8 Changed 5 years ago by nickm

  • Component changed from Tor Client to Tor

comment:9 Changed 5 years ago by arma

I think the saferlab folks ran into this bug today.

comment:10 Changed 5 years ago by arma

  • Parent ID set to #7172

comment:11 Changed 5 years ago by karsten

  • Cc karsten added

comment:12 Changed 5 years ago by arma

rHere's the diff I gave Chris, who used it successfully for the December demo:

diff --git a/src/or/directory.c b/src/or/directory.c
index 1d511b5..2ba5d54 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c
@@ -3616,7 +3616,7 @@ static const int server_dl_schedule[] = {
 };
 /** Schedule for when clients should download things in general. */
 static const int client_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, INT_MAX
+  0, 0, 5, 10, 15, 20, 30, 60
 };
 /** Schedule for when servers should download consensuses. */
 static const int server_consensus_dl_schedule[] = {
@@ -3624,7 +3624,7 @@ static const int server_consensus_dl_schedule[] = {
 };
 /** Schedule for when clients should download consensuses. */
 static const int client_consensus_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12
+  0, 0, 5, 10, 15, 20, 30, 60
 };
 /** Schedule for when clients should download bridge descriptors. */
 static const int bridge_dl_schedule[] = {

I also added

diff --git a/src/or/main.c b/src/or/main.c
index 446836a..e3b9345 100644
--- a/src/or/main.c
+++ b/src/or/main.c
@@ -148,7 +148,7 @@ int can_complete_circuit=0;
 
 /** How often do we check for router descriptors that we should download
  * when we have too little directory info? */
-#define GREEDY_DESCRIPTOR_RETRY_INTERVAL (10)
+#define GREEDY_DESCRIPTOR_RETRY_INTERVAL (5)
 /** How often do we check for router descriptors that we should download
  * when we have enough directory info? */
 #define LAZY_DESCRIPTOR_RETRY_INTERVAL (60)
diff --git a/src/or/nodelist.c b/src/or/nodelist.c
index 95345fb..3b42994 100644
--- a/src/or/nodelist.c
+++ b/src/or/nodelist.c
@@ -1345,10 +1345,10 @@ update_router_have_minimum_dir_info(void)
 
 /* What fraction of desired server descriptors do we need before we will
  * build circuits? */
-#define FRAC_USABLE_NEEDED .75
+#define FRAC_USABLE_NEEDED .5
 /* What fraction of desired _exit_ server descriptors do we need before we
  * will build circuits? */
-#define FRAC_EXIT_USABLE_NEEDED .5
+#define FRAC_EXIT_USABLE_NEEDED .3
 
   if (num_present < num_usable * FRAC_USABLE_NEEDED) {
     tor_snprintf(dir_info_status, sizeof(dir_info_status),
diff --git a/src/or/routerlist.c b/src/or/routerlist.c
index 1735837..6688591 100644
--- a/src/or/routerlist.c
+++ b/src/or/routerlist.c
@@ -3987,7 +3987,7 @@ initiate_descriptor_downloads(const routerstatus_t *source,
 #define MAX_DL_TO_DELAY 16
 /** When directory clients have only a few servers to request, they batch
  * them until they have more, or until this amount of time has passed. */
-#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST (10*60)
+#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST 5
 
 /** Given a <b>purpose</b> (FETCH_MICRODESC or FETCH_SERVERDESC) and a list of
  * router descriptor digests or microdescriptor digest256s in

You'll notice in the download schedules, I don't have any INT_MAX at the end -- it just keeps trying, often, for every descriptor. In a closed Tor network that *should* be safe to do.

More generally, there seem to be two use cases for TestingTorNetwork here: are you attempting to faithfully reproduce timing/etc problems from the real Tor network, or is the goal just-run-the-damn-Tor-network-and-make-it-work?

comment:13 Changed 5 years ago by nickm

  • Keywords small-feature added

We could do this if a patch makes it in for the small-features deadline, but I don't know if I'll have time to write one.

More generally, there seem to be two use cases for TestingTorNetwork here: are you attempting to faithfully reproduce timing/etc problems from the real Tor network, or is the goal just-run-the-damn-Tor-network-and-make-it-work?

I use it for the latter mostly; but if people use it for the former, this needs more thought. Perhaps this option needs another name, and needs to be settable-only-when-TestingTorNetwork==1

comment:14 Changed 5 years ago by nickm

  • Type changed from defect to enhancement

comment:15 follow-ups: Changed 4 years ago by karsten

I started looking into this today, and I think we should add new config options, e.g., TestingClientDownloadSchedule (accepting a CSV list), TestingClientConsensusDownloadSchedule (accepting a CSV list), and TestingClientMaxIntervalWithoutRequest (accepting an INTERVAL) that can only be changed if TestingTorNetwork is set. I hope to get away without changing all those other constants that arma changed in the diff he gave to Chris. The fewer new torrc options we add, the better. But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.

But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?

With respect to the use case where people attempt to faithfully reproduce timing problems: we're already changing plenty of timings in TestingTorNetwork mode. If this use case exists, people should manually reset timing-related options to non-TestingTorNetwork defaults. Not directly related to this issue though.

comment:16 in reply to: ↑ 15 ; follow-up: Changed 4 years ago by nickm

Replying to karsten:

I started looking into this today, and I think we should add new config options, e.g., TestingClientDownloadSchedule (accepting a CSV list), TestingClientConsensusDownloadSchedule (accepting a CSV list), and TestingClientMaxIntervalWithoutRequest (accepting an INTERVAL) that can only be changed if TestingTorNetwork is set. I hope to get away without changing all those other constants that arma changed in the diff he gave to Chris. The fewer new torrc options we add, the better. But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.

I think that approach sounds reasonable to me.

But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?

I haven't run into this myself; maybe Rob would know? Sometimes Chutney gets into a state where the network needs to be restarted after the authorities bootstrap. You could try that; ping me if you need help.

comment:17 in reply to: ↑ 16 Changed 4 years ago by robgjansen

Replying to nickm:

Replying to karsten:

But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?

I haven't run into this myself; maybe Rob would know? Sometimes Chutney gets into a state where the network needs to be restarted after the authorities bootstrap. You could try that; ping me if you need help.

I asked Chris about this problem, as I believe he has more experience with it than I. Here is his response (Note that he was not using Shadow):

We ran into that situation because of a slightly pathological case in our code. It happens frequently if descriptors get updated and pushed to the directories more frequently than the consensus period, which our code was doing. This significantly exacerbates the general problem by increasing the number of failures. I'm not sure if that's a viable test case though, since it's bad behavior (and we've changed our code to no longer do that).


The problem may be more generally reproducible simply by starting the directories at exactly the same time you start the clients. The directories won't have completed the consensus negotiation (assuming a TestingTorNetwork interval of 5 minutes) by the time the clients get into the 60*5 back off period, so the clients will back off for 10 minutes.


For this to work, you probably need 5 authoritative directories (to make sure their negotiations take a while).

comment:18 follow-up: Changed 4 years ago by karsten

I tried a Shadow network with 5 authorities and with clients starting at the same time as authorities, but I can't reproduce this situation. I applied this patch with a crazy retry schedule and with log messages to notice when clients switched to a different retry interval:

diff --git a/src/or/directory.c b/src/or/directory.c
index f235bf3..b654a85 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c
@@ -3625,7 +3625,8 @@ static const int server_dl_schedule[] = {
 };
 /** Schedule for when clients should download things in general. */
 static const int client_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, INT_MAX
+  //0, 0, 60, 60*5, 60*10, INT_MAX
+  15, INT_MAX
 };
 /** Schedule for when servers should download consensuses. */
 static const int server_consensus_dl_schedule[] = {
@@ -3633,7 +3634,8 @@ static const int server_consensus_dl_schedule[] = {
 };
 /** Schedule for when clients should download consensuses. */
 static const int client_consensus_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12
+  //0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12
+  15, INT_MAX
 };
 /** Schedule for when clients should download bridge descriptors. */
 static const int bridge_dl_schedule[] = {
@@ -3708,14 +3710,14 @@ download_status_increment_failure(download_status_t *dls, int status_code,
 
   if (item) {
     if (increment == 0)
-      log_debug(LD_DIR, "%s failed %d time(s); I'll try again immediately.",
+      log_info(LD_DIR, "XXX6752 %s failed %d time(s); I'll try again immediately.",
                 item, (int)dls->n_download_failures);
     else if (dls->next_attempt_at < TIME_MAX)
-      log_debug(LD_DIR, "%s failed %d time(s); I'll try again in %d seconds.",
+      log_info(LD_DIR, "XXX6752 %s failed %d time(s); I'll try again in %d seconds.",
                 item, (int)dls->n_download_failures,
                 (int)(dls->next_attempt_at-now));
     else
-      log_debug(LD_DIR, "%s failed %d time(s); Giving up for a while.",
+      log_info(LD_DIR, "XXX6752 %s failed %d time(s); Giving up for a while.",
                 item, (int)dls->n_download_failures);
   }
   return dls->next_attempt_at;
@@ -3738,6 +3740,8 @@ download_status_reset(download_status_t *dls)
   find_dl_schedule_and_len(dls, get_options()->DirPort_set,
                            &schedule, &schedule_len);
 
+  if (dls->n_download_failures)
+    log_info(LD_DIR, "XXX6752 Resetting download status.");
   dls->n_download_failures = 0;
   dls->next_attempt_at = time(NULL) + schedule[0];
 }

Here's the result:

$ grep webclient1 data/scallion.log | grep XXX6752
0:0:10:543339 [thread-0] 0:3:2:000000010 [scallion-info] [webclient1-82.1.0.0] [intercept_logv] [info] void download_status_reset(download_status_t *)() XXX6752 Resetting download status.
0:0:13:251122 [thread-0] 0:6:2:000000011 [scallion-info] [webclient1-82.1.0.0] [intercept_logv] [info] void download_status_reset(download_status_t *)() XXX6752 Resetting download status.

Note that there are no failures in those logs. Also, clients bootstrap just fine, though it takes 10 simulated minutes to do so:

$ grep webclient1 data/scallion.log | grep Bootstrap
0:0:2:638041 [thread-0] 0:0:2:000000000 [scallion-info] [webclient1-82.1.0.0] [intercept_logv] [info] Bootstrapped 0%: Starting.
0:0:9:968690 [thread-0] 0:1:3:000000000 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 5%: Connecting to directory server.
0:0:9:974529 [thread-0] 0:1:3:061000001 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 10%: Finishing handshake with directory server.
0:0:9:990886 [thread-0] 0:1:3:247299267 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
0:0:9:995853 [thread-0] 0:1:3:325961616 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 20%: Asking for networkstatus consensus.
0:0:10:001487 [thread-0] 0:1:3:397083916 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 25%: Loading networkstatus consensus.
0:0:15:171272 [thread-0] 0:7:9:696519922 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 40%: Loading authority key certs.
0:0:15:193358 [thread-0] 0:7:9:822844471 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 45%: Asking for relay descriptors.
0:0:17:437195 [thread-0] 0:10:8:851821396 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 80%: Connecting to the Tor network.
0:0:17:437503 [thread-0] 0:10:8:851821396 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 90%: Establishing a Tor circuit.
0:0:17:487530 [thread-0] 0:10:9:651673193 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.

How would I force clients to make just two attempts, with the second attempt happening 15 seconds after the first, and then wait forever?

comment:19 Changed 4 years ago by nickm

How would I force clients to make just two attempts, with the second attempt happening 15 seconds after the first, and then wait forever?

You could #ifdef out the contents of download_status_reset, I guess? But that's probably not what's causing this bug. Any other ideas?

comment:20 Changed 4 years ago by robgjansen

  • Cc cwacek added

I just cc'ed chris, as I really have no experience with this bug.

comment:21 in reply to: ↑ 15 Changed 4 years ago by arma

Replying to karsten:

But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules?

My first suggestion would be to start your client before a consensus is available, i.e. before the directory authorities have computed a consensus. I expect your Tor client will try twice, and then wait a whole minute before trying a third time.

A more realistic, but more subtle to reproduce case, would be one where there's a bug in Tor that prevents some of the directory mirrors or authorities from having a copy of the consensus. Then clients that try to bootstrap from those dir points will fail (nothing to get) and not retry on a proper time schedule.

comment:22 Changed 4 years ago by arma

As we learned in #6341, if the *mirrors* don't get a consensus the first time they try, they also wait way too long to try again. This problem exacerbates the client problem, since it raises the chances that clients will fail to get something on the first try.

I believe my patch above doesn't crank down the mirror retry schedule. But we should.

comment:23 in reply to: ↑ 18 Changed 4 years ago by arma

Replying to karsten:

Note that there are no failures in those logs. Also, clients bootstrap just fine, though it takes 10 simulated minutes to do so:

Ten minutes is sure a long time in a demo.

How would I force clients to make just two attempts, with the second attempt happening 15 seconds after the first, and then wait forever?

Set the schedule to {0, 15, INT_MAX}, yes?

comment:24 in reply to: ↑ 15 Changed 4 years ago by arma

Replying to karsten:

I hope to get away without changing all those other constants that arma changed in the diff he gave to Chris.

I think you can leave GREEDY_DESCRIPTOR_RETRY_INTERVAL and FRAC_USABLE_NEEDED and FRAC_EXIT_USABLE_NEEDED alone, without hurting usability much. I think MAX_CLIENT_INTERVAL_WITHOUT_REQUEST is way way way too high for a testing network's other time parameters.

comment:25 in reply to: ↑ 15 ; follow-up: Changed 4 years ago by arma

Replying to karsten:

But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.

Why is this? I think when TestingTorNetwork is set, we will want the defaults for these timings to be the reduced ones. Otherwise we'll have to have all our instructions tell you to add the following three lines, not one line, to your torrc when running with a testing network.

If people want to run a realistic Tor network, and have it be realistic about timings too, then imo they need to turn TestingTorNetwork off in order to get their realism. I don't see how we can crank down some of the time parameters, and not others, and still make claims that you're getting like-real behavior.

comment:26 in reply to: ↑ 25 Changed 4 years ago by karsten

Replying to arma:

Replying to karsten:

But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.

Why is this? I think when TestingTorNetwork is set, we will want the defaults for these timings to be the reduced ones. Otherwise we'll have to have all our instructions tell you to add the following three lines, not one line, to your torrc when running with a testing network.

Ah, what I meant to say was that there should be one new config option per newly configurable timing value and that they should all be set to reasonable testing defaults when TestingTorNetwork is set. TestingTorNetwork shouldn't magically set anything that cannot be changed by setting other config options afterwards. So, one torrc line for most people, not four.

If people want to run a realistic Tor network, and have it be realistic about timings too, then imo they need to turn TestingTorNetwork off in order to get their realism. I don't see how we can crank down some of the time parameters, and not others, and still make claims that you're getting like-real behavior.

Agreed.

Will reply to your earlier comments once I have better results. Still trying to reproduce the issue somehow.

comment:27 Changed 4 years ago by karsten

Okay, I start seeing the issue. I have a Shadow network with five authorities that start at minutes 0, 5, 10, 15, and 25, and a client that starts at minute 0. With improved logging, I can now see how the client uses a schedule of 0, 60, 300, ... seconds to retry failed downloads.

Here's what confused me in the first place. Every 180 seconds, a client says "Application request when we haven't used client functionality lately. Optimistically trying directory fetches again." in its logs, resets download schedules, and tries all directory fetches again. Non-clients don't do that, probably because there's no Shadow thing making an application request. Which is bad, because they don't cache directory information then. However, this 180-second thing made the problem look not too bad, because 3 or even 10 minutes aren't much in a 5 minute consensus schedule. But I think we need to fix this.

I'll look more into the issue tomorrow and hopefully come up with a patch.

comment:28 follow-up: Changed 4 years ago by karsten

I wrote a patch that makes retry schedules and that one constant configurable. I also see how the client says it uses the new retry values. But bootstrapping is still slower than expected. Could it be that there's something else not making directory requests even though our retry schedule would permit it?

comment:29 follow-up: Changed 4 years ago by arma

For your patch, the config options should not be called TestingFoo since they are used (albeit with different values) even when not testing.

comment:30 in reply to: ↑ 28 ; follow-up: Changed 4 years ago by arma

Replying to karsten:

But bootstrapping is still slower than expected. Could it be that there's something else not making directory requests even though our retry schedule would permit it?

Can you provide specifics on timings? That should help us look for what other times might be hard-coded too high.

comment:31 in reply to: ↑ 30 Changed 4 years ago by karsten

Replying to arma:

Replying to karsten:

But bootstrapping is still slower than expected. Could it be that there's something else not making directory requests even though our retry schedule would permit it?

Can you provide specifics on timings? That should help us look for what other times might be hard-coded too high.

Here's the client log (2.8M): https://people.torproject.org/~karsten/volatile/scallion-webclient1.log

comment:32 in reply to: ↑ 29 ; follow-up: Changed 4 years ago by karsten

Replying to arma:

For your patch, the config options should not be called TestingFoo since they are used (albeit with different values) even when not testing.

I don't recall the exact reasons behind calling config options TestingFoo, but wasn't it something about not being able to change config values unless TestingTorNetwork is set? There are other config options, like TestingAuthDirTimeToLearnReachability, which are used (with different values) in normal operation, but which can be changed in testing mode. I can change this, but I'm not yet convinced it shouldn't be TestingFoo.

comment:33 in reply to: ↑ 32 Changed 4 years ago by arma

Replying to karsten:

There are other config options, like TestingAuthDirTimeToLearnReachability, which are used (with different values) in normal operation, but which can be changed in testing mode.

Ah. That one has an explicit check in config.c:

  if (options->TestingAuthDirTimeToLearnReachability != 30*60 &&
      !options->TestingTorNetwork && !options->UsingTestNetworkDefaults_) {
    REJECT("TestingAuthDirTimeToLearnReachability may only be changed in "
           "testing Tor networks!");

comment:34 Changed 4 years ago by karsten

Yes, there should be similar checks for the new options, but I didn't find an easy way to compare smartlists for equality.

But more importantly, did the client log provide any insights what other times we need to lower?

comment:35 Changed 4 years ago by karsten

I think I found out what prevented us from making new directory requests every 60 seconds as defined in the lowered refetch schedule: there's a hard-coded 5 minute timeout for letting a directory connection stall before expiring it: DIR_CONN_MAX_STALL. We'll probably need to lower this timeout to, say, 30 seconds in a testing network.

Also, I found a problem from lowered refetch schedules that we need to fix: there are (at least) four constants defining how many times we try a descriptor download before giving up: CONSENSUS_NETWORKSTATUS_MAX_DL_TRIES, MAX_ROUTERDESC_DOWNLOAD_FAILURES, MAX_MICRODESC_DOWNLOAD_FAILURES, MAX_CERT_DL_FAILURES. They're all set to 8, but we'll probably want to set them to 80 or 100 in a testing network. If we don't do that, especially with retries every 60 seconds we may easily run into 7, 8, 9 failed attempts, and then we stop trying. This is particularly bad on non-clients which aren't triggered by an external application to reset the download schedule. In my experiments I had some nodes not bootstrap at all within 1 hour if these constants are set to 8.

If you like these changed constants in a testing network, I'll include them as five new torrc options in my patch.

Want to see some results? When I use the lowered refetch schedules and set the constants above as described, I get the following "Boostrapped 100%" lines:

0:0:11:753547 [thread-0] 0:15:3:173889106 [scallion-message] [1uthority-73.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:11:763847 [thread-0] 0:15:3:289621246 [scallion-message] [2uthority-74.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:11:792273 [thread-0] 0:15:3:673929774 [scallion-message] [3uthority-75.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:12:794207 [thread-0] 0:15:21:967156005 [scallion-message] [nonexit2-84.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:12:859044 [thread-0] 0:15:22:882704137 [scallion-message] [exit1-78.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:13:226706 [thread-0] 0:15:28:358539690 [scallion-message] [webclient1-88.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:14:458496 [thread-0] 0:15:53:001822682 [scallion-message] [exit5-82.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:14:471866 [thread-0] 0:15:53:334226076 [scallion-message] [exit4-81.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:14:536720 [thread-0] 0:15:54:845187276 [scallion-message] [nonexit4-86.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:14:648466 [thread-0] 0:15:55:760034257 [scallion-message] [nonexit5-87.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:16:346193 [thread-0] 0:16:39:063729918 [scallion-message] [exit3-80.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:17:050978 [thread-0] 0:16:53:675256680 [scallion-message] [nonexit3-85.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:17:401589 [thread-0] 0:16:59:462751388 [scallion-message] [exit2-79.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:17:542252 [thread-0] 0:17:0:740213796 [scallion-message] [nonexit1-83.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:19:847435 [thread-0] 0:17:38:479260129 [scallion-message] [4uthority-76.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:23:362068 [thread-0] 0:20:14:179430946 [scallion-message] [5uthority-77.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.

The second timestamp in these log lines is simulated time (0:15:3:173889106 in the first line). The very first consensus in this network is published at 0:15:0. That means that the longest time to bootstrap is 2:38 minutes. 5uthority starts at time 0:20:2 and takes 12 seconds to bootstrap. This seems very acceptable. For comparison, here's the situation before changing the five constants above:

0:0:11:532841 [thread-0] 0:15:3:107393402 [scallion-message] [1uthority-73.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:11:533525 [thread-0] 0:15:3:127368755 [scallion-message] [2uthority-74.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:11:675343 [thread-0] 0:15:4:268884973 [scallion-message] [3uthority-75.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:13:908788 [thread-0] 0:18:28:931568544 [scallion-message] [nonexit4-86.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:13:929487 [thread-0] 0:18:29:280680071 [scallion-message] [exit1-78.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:14:414258 [thread-0] 0:18:45:522758087 [scallion-message] [4uthority-76.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:16:414355 [thread-0] 0:20:13:955995654 [scallion-message] [5uthority-77.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:16:651176 [thread-0] 0:20:21:901359712 [scallion-message] [nonexit1-83.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:16:804762 [thread-0] 0:20:24:057107728 [scallion-message] [nonexit5-87.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:16:870519 [thread-0] 0:20:27:890874732 [scallion-message] [exit2-79.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:19:363695 [thread-0] 0:23:23:469750053 [scallion-message] [exit4-81.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:19:920079 [thread-0] 0:24:20:250686967 [scallion-message] [exit5-82.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:20:673098 [thread-0] 0:24:49:322543288 [scallion-message] [webclient1-88.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:21:749740 [thread-0] 0:25:38:208126411 [scallion-message] [nonexit3-85.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.

And finally, here's the situation before applying any patch:

0:0:11:448903 [thread-0] 0:15:2:582133247 [scallion-message] [3uthority-75.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:11:526322 [thread-0] 0:15:3:715646237 [scallion-message] [1uthority-73.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:11:527393 [thread-0] 0:15:3:736179239 [scallion-message] [2uthority-74.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:13:131711 [thread-0] 0:16:38:776506156 [scallion-message] [4uthority-76.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:13:502492 [thread-0] 0:17:20:866255937 [scallion-message] [nonexit2-84.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:13:580818 [thread-0] 0:17:22:338375323 [scallion-message] [exit1-78.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:13:709414 [thread-0] 0:17:23:833521695 [scallion-message] [nonexit5-87.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:15:607681 [thread-0] 0:19:12:118365753 [scallion-message] [exit3-80.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:15:917059 [thread-0] 0:19:19:662494634 [scallion-message] [exit5-82.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:17:402870 [thread-0] 0:20:14:413988393 [scallion-message] [5uthority-77.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:17:694102 [thread-0] 0:20:20:697006789 [scallion-message] [exit4-81.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:19:807030 [thread-0] 0:23:26:615529962 [scallion-message] [nonexit1-83.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:20:406078 [thread-0] 0:24:24:836237885 [scallion-message] [exit2-79.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:21:459315 [thread-0] 0:26:10:050116590 [scallion-message] [webclient1-88.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:25:253972 [thread-0] 0:29:31:480640102 [scallion-message] [nonexit4-86.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.
0:0:35:308164 [thread-0] 0:38:43:258560918 [scallion-message] [nonexit3-85.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.

comment:36 Changed 4 years ago by karsten

I should have noted that in the second result (before changing the five constants) only 14 of 16 nodes managed to bootstrap. That is because of not raising the max-download constants from 8 to 64 in this case.

comment:37 follow-up: Changed 4 years ago by nickm

  • Milestone changed from Tor: 0.2.4.x-final to Tor: 0.2.5.x-final

This approach seems plausible. It's new-feature-like enough that 0.2.4 doesn't seem like a plausible target at this point though IMO. I'd suggest making sure that the options can only be set when TestingTorNetwork is 1.

comment:38 in reply to: ↑ 37 ; follow-up: Changed 4 years ago by karsten

  • Status changed from new to needs_review

Replying to nickm:

This approach seems plausible.

Great! Please review branch task-6752 in my public repository.

This branch is based on maint-0.2.3, because I couldn't test an 0.2.4 or higher branch in Shadow. I can rebase to master if necessary.

This branch should also be squashed before merging, because it first adds some logging statements and later removes them. I wasn't sure if you reviewed parts of this branch before, so I didn't squash commits yet.

It's new-feature-like enough that 0.2.4 doesn't seem like a plausible target at this point though IMO.

Makes sense.

I'd suggest making sure that the options can only be set when TestingTorNetwork is 1.

Yes, I think I did that.

Thanks!

comment:39 follow-up: Changed 4 years ago by nickm

I don't think I've reviewed this before; a squashed alternative would be neat.

comment:40 in reply to: ↑ 39 Changed 4 years ago by karsten

Replying to nickm:

I don't think I've reviewed this before; a squashed alternative would be neat.

Sure, please see branch task-6752-2 in my public repo.

comment:41 in reply to: ↑ 38 Changed 4 years ago by robgjansen

Replying to karsten:

This branch is based on maint-0.2.3, because I couldn't test an 0.2.4 or higher branch in Shadow. I can rebase to master if necessary.

As of today, Shadow should work with all Tor versions through 0.2.4.9-alpha.

comment:42 Changed 4 years ago by karsten

Please review branch task-6752-3 in my public repo. Rebased to master and tested using Shadow.

comment:43 follow-up: Changed 4 years ago by nickm

  • Status changed from needs_review to needs_revision

Looks good; i want to get this in.

Quick review:

  • The documentation for smartlist_ints_eq needs to say that the smartlists are lists of pointer to int.
  • There is a huge new amount of boilerplate copy-and-paste code in options_validate(). Please use macros or functions to avoid piles of duplicated code?
  • You added a new argument to options_validate but didn't document how it works.
  • Prefer smartlist_add_asprintf() to tor_asprintf(); smartlist_add().
  • find_dl_schedule_and_len should document its arguments and return types. (Yes, I know it didn't before, but it should.)
  • Shouldn't find_dl_schedule return a const smartlist_* ?
  • It's weird that this function accesses Testing* options, but it's always used regardless of whether we're running in testing mode or not. I don't much like having Testing* mean "a variable that you can only change when Testing* is on, but which is used in all cases." Can we come up with some better way to do this?

comment:44 in reply to: ↑ 43 Changed 4 years ago by karsten

  • Status changed from needs_revision to needs_review

Replying to nickm:

Looks good; i want to get this in.

Great! Thanks for the code review! See comments below.

Quick review:

  • The documentation for smartlist_ints_eq needs to say that the smartlists are lists of pointer to int.
  • There is a huge new amount of boilerplate copy-and-paste code in options_validate(). Please use macros or functions to avoid piles of duplicated code?
  • You added a new argument to options_validate but didn't document how it works.
  • Prefer smartlist_add_asprintf() to tor_asprintf(); smartlist_add().
  • find_dl_schedule_and_len should document its arguments and return types. (Yes, I know it didn't before, but it should.)
  • Shouldn't find_dl_schedule return a const smartlist_* ?

Fixed all of these, I think. Please see my updated branch.

  • It's weird that this function accesses Testing* options, but it's always used regardless of whether we're running in testing mode or not. I don't much like having Testing* mean "a variable that you can only change when Testing* is on, but which is used in all cases." Can we come up with some better way to do this?

I agree that it feels strange to access Testing* options in code even in non-testing mode. So, what we could do is remove the Testing* part of config option names and still require that they can only be changed when TestingTorNetwork is set; if you want me to do that, can you suggest how to tweak CHECK_DEFAULT to include the arg string in the message?

However, from a user point of view, it's quite useful to realize quickly that one cannot change a config option unless running a testing network. The Testing prefix does a good job there, I think.

I'm not sure what other ways there might be. Hmm.

comment:45 follow-up: Changed 4 years ago by nickm

  • Status changed from needs_review to needs_revision

Looking at updated branch:

  • The claim "does not differ from <b>default_options</b> unless in testing Tor networks" in the new options_validate operation is wrong. It is okay for most options to differ from the defaults, but this documentation implies that you can't set any options unless TestingTorNetwork is set.

Otherwise looks okay.

I'm not sure what other ways there might be. Hmm.

I'll open a new ticket for this.

comment:46 in reply to: ↑ 45 Changed 4 years ago by karsten

  • Status changed from needs_revision to needs_review

Replying to nickm:

Looking at updated branch:

  • The claim "does not differ from <b>default_options</b> unless in testing Tor networks" in the new options_validate operation is wrong. It is okay for most options to differ from the defaults, but this documentation implies that you can't set any options unless TestingTorNetwork is set.

True. I tried a better, more correct phrasing. Please see my updated branch.

Otherwise looks okay.

Great!

I'm not sure what other ways there might be. Hmm.

I'll open a new ticket for this.

Okay.

Thanks!

comment:47 Changed 4 years ago by nickm

  • Resolution set to implemented
  • Status changed from needs_review to closed

Okay, merging. Let's see how it goes.

comment:48 Changed 3 years ago by arma

For those who have been relying on this new behavior: due to #11679, consensus fetches have been using the descriptor download schedule, not the consensus download schedule, basically this whole time.

Note: See TracTickets for help on using tickets.