Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#19969 closed defect (fixed)

tor client does not immediately open new circuits after standby

Reported by: weasel Owned by:
Priority: High Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version: Tor: 0.2.8.6
Severity: Normal Keywords: regression 029-backport 028-backport
Cc: viktor_jaegerskuepper@… Actual Points:
Parent ID: Points: 2
Reviewer: Sponsor:

Description

Viktor writes via https://bugs.debian.org/835119:

I use tor only as a client to connect icedove to the tor network with
the extension Torbirdy (on port 9050). With the tor version 0.2.8.6 I
can't immediately connect to any mail server or news feed after the pc
woke up from standby ("long" time in standby) and I started icedove. I
have to wait for several minutes in order to connect successfully, but
the timespan seems to be random. This does not occur after a (re)boot.
The first version I remember to have this issue is 0.2.8.6-2, I did an
upgrade from 0.2.7.6-1 to 0.2.8.6-2, so I skipped the alpha and rc
versions and the first upload to unstable. I am very sure that the issue
didn't occur in version 0.2.7.6-1 which I used for several months. I can
exclude network connectivity problems because e.g. I can immediately
start the Tor Browser after standby.

Today I purged tor, installed version 0.2.7.6-1, copied the old "state"
file to /var/lib/tor, and set the pc in standby mode for a couple of
minutes. After waking up from standby I immediately tried to connect to
a mail server which worked. Then I upgraded step by step to every
version of tor 0.2.8 which I could find on snapshot.debian.org and tried
to connect to a mail server immediately after waking up from standby.
Unfortunately I could not reproduce the bug then. Finally with version
0.2.8.6-3 the bug occured again, but only after a "long" standby time
(almost 90 minutes).

Attached are two log files from the weekend and the complete log from
today after the installation of version 0.2.7.6-1.

As you can see, the bug is not easily reproducible, and the logs don't
show any particular reason for why tor does not open new circuits
immediately. Please tell me what I can do to give you more information
about the bug.

Child Tickets

Change History (54)

comment:1 Changed 3 years ago by arma

Keywords: regression added

I am guessing the Orbot folks will find this bug really important.

comment:2 Changed 3 years ago by nickm

Milestone: Tor: 0.2.8.x-final

comment:3 Changed 3 years ago by lunar

This seems to happen much more often with only one bridge configured. Could it be related to #19989?

comment:4 Changed 3 years ago by viktorj

I am the original bug reporter. Over the last weeks I used version 0.2.7.6 and I did not ran into the timeout in thunderbird (I guess this is 2 minutes) and so had to wait for tor opening circuits. However tor still did not open new circuits immediately after standby. Until today I never saw this as a bug, but now I think tor should open new circuits immediately as it does after a (re)boot. Today I tested version 0.2.4.27 and it behaved just like 0.2.7.6.

So in conclusion there is a regression in 0.2.8, but from my pov the bug was there for a long time.

comment:5 Changed 3 years ago by arma

Ok, it just happened to me (running "Tor 0.2.9.3-alpha-dev (git-bfaded9143d127cb)"). My tor client has been on an airplane all day, firewalled from the internet. Now I made a request to it, and it just sat there -- it received the request but did not attempt to make any circuits or anything.

I connected over the control port and asked it 'getinfo circuit-status' and it has no circuits open.

It is *supposed* to say

        log_fn(severity, LD_APP|LD_DIR,
               "Application request when we haven't %s. "
               "Optimistically trying directory fetches again.",
               !router_have_minimum_dir_info() ?
               "used client functionality lately" :
               "received a consensus with exits");

and then start making circuits and stuff.

It's weird that it didn't even try to make a circuit for the new stream request though.

comment:6 Changed 3 years ago by arma

Next piece of the mystery:

Oct 25 18:51:45.580 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.
Oct 25 21:58:43.520 [notice] Your system clock just jumped 11174 seconds forward; assuming established circuits no longer work.
Oct 25 21:58:54.825 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.

How come my Tor client didn't spring into action after it noticed the clock jump?

comment:7 Changed 3 years ago by teor

In 0.2.8, we redesigned the timer table.
And we redesigned how we retrieve new consensuses when our consensus has expired.
Maybe we need to tweak one or more of these things.

I think we implemented expotential backoff in 0.2.9. Not sure if that will make it better or worse.

comment:8 Changed 3 years ago by dgoulet

Milestone: Tor: 0.2.8.x-finalTor: 0.3.0.x-final
Priority: MediumHigh

comment:9 Changed 3 years ago by viktorj

Cc: viktor_jaegerskuepper@… added

comment:10 Changed 3 years ago by arma

Points: 2

Ok, I spent a while investigating this ticket yesterday. For context, the Tor client on my travel laptop is still stuck in this bug, so I can poke it in various ways to help debug if we think of useful ways. Here is a checkpoint of my thinking process.

There are two mysteries here:

Mystery 1) Why does my Tor client not care to get any new directory information?

It's October 29 now, and my tor's datadirectory contains:

-rw-------  1 arma arma 1405537 Oct 21 12:14 cached-microdesc-consensus
-rw-------  1 arma arma 3473955 Oct 21 12:43 cached-microdescs
-rw-------  1 arma arma       0 Oct 21 12:43 cached-microdescs.new

After one of the clock jump events, my Tor tried to fetch a consensus from a relay in the fallbackdir list, then it fell back to trying to fetch a consensus from each of the directory authorities, and when one attempt failed for each of those, it seems to have permanently given up -- and that was days ago. I'm going to leave this mystery for later -- somebody please feel free to pick it up in the meantime.

Mystery 2) Why does my Tor client not recognize the new application (socks) requests and jumpstart itself into trying to fetch new directory stuff, make new circuits, and so on?

My investigations led me to be suspicious of Tor commit b1d56fc58.

Background: in the past a new socks request would come in, and connection_ap_handshake_rewrite_and_attach() would get called on it, which would call connection_ap_handshake_attach_circuit(), which would call circuit_get_open_circ_or_launch(), and that function is the one that does the "omg I don't have enough directory information and I just got a socks request, I should reset everything so we can try again" logic.

But in commit b1d56fc58, we made connection_ap_handshake_rewrite_and_attach() no longer attach it, but rather it now calls connection_ap_mark_as_pending_circuit(), which effectively queues the stream for handling the next time the main loop runs. That function adds the stream to the pending_entry_connections smartlist, and sets untried_pending_connections to 1, and returns.

Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.

But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!

The earlier parts of connection_ap_handshake_rewrite_and_attach() *are* being called, for example rep_hist_note_used_port() gets called just fine. So I think the stream is being added to pending_entry_connections, and then somehow things aren't working after that.

comment:11 in reply to:  10 ; Changed 3 years ago by teor

Replying to arma:

Ok, I spent a while investigating this ticket yesterday. For context, the Tor client on my travel laptop is still stuck in this bug, so I can poke it in various ways to help debug if we think of useful ways. Here is a checkpoint of my thinking process.

There are two mysteries here:

Mystery 1) Why does my Tor client not care to get any new directory information?

It's October 29 now, and my tor's datadirectory contains:

-rw-------  1 arma arma 1405537 Oct 21 12:14 cached-microdesc-consensus
-rw-------  1 arma arma 3473955 Oct 21 12:43 cached-microdescs
-rw-------  1 arma arma       0 Oct 21 12:43 cached-microdescs.new

After one of the clock jump events, my Tor tried to fetch a consensus from a relay in the fallbackdir list, then it fell back to trying to fetch a consensus from each of the directory authorities, and when one attempt failed for each of those, it seems to have permanently given up -- and that was days ago. I'm going to leave this mystery for later -- somebody please feel free to pick it up in the meantime.

This sounds like an issue with the consensus download implementation. Here are my questions:

  • has your tor has marked each of the directory authorities down?
  • has it marked all of the fallback directories down?
  • has it exceeded the download schedule for downloading consensuses?

There are two kinds of download_status_t in 0.2.8, one retries on a schedule regardless of success or failure, and the other retries on failure. It seems there might be a problem with the first kind (which is new in 0.2.8, and used to download the consensus).

I'd like to know what your consensus download_status_t has for all its fields, particularly the attempt and failure counts. It may well have exceeded the 255 failure count limit.

I also wonder how #8625 (commit 09a0f2d) affects this, but it is only in 0.2.9, and I'm not sure it would make it any better.

Mystery 2) Why does my Tor client not recognize the new application (socks) requests and jumpstart itself into trying to fetch new directory stuff, make new circuits, and so on?

My investigations led me to be suspicious of Tor commit b1d56fc58.

Background: in the past a new socks request would come in, and connection_ap_handshake_rewrite_and_attach() would get called on it, which would call connection_ap_handshake_attach_circuit(), which would call circuit_get_open_circ_or_launch(), and that function is the one that does the "omg I don't have enough directory information and I just got a socks request, I should reset everything so we can try again" logic.

But in commit b1d56fc58, we made connection_ap_handshake_rewrite_and_attach() no longer attach it, but rather it now calls connection_ap_mark_as_pending_circuit(), which effectively queues the stream for handling the next time the main loop runs. That function adds the stream to the pending_entry_connections smartlist, and sets untried_pending_connections to 1, and returns.

Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.

But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!

The earlier parts of connection_ap_handshake_rewrite_and_attach() *are* being called, for example rep_hist_note_used_port() gets called just fine. So I think the stream is being added to pending_entry_connections, and then somehow things aren't working after that.

This might be related to the above bug, or it might not be.
This code would need to reset the authority and fallback failures, and also the download status for the consensus.

comment:12 in reply to:  11 Changed 3 years ago by arma

Replying to teor:

Replying to arma:

Mystery 1) Why does my Tor client not care to get any new directory information?

This sounds like an issue with the consensus download implementation. Here are my questions:

  • has your tor has marked each of the directory authorities down?
  • has it marked all of the fallback directories down?
  • has it exceeded the download schedule for downloading consensuses?

There are two kinds of download_status_t in 0.2.8, one retries on a schedule regardless of success or failure, and the other retries on failure. It seems there might be a problem with the first kind (which is new in 0.2.8, and used to download the consensus).

I'd like to know what your consensus download_status_t has for all its fields, particularly the attempt and failure counts. It may well have exceeded the 255 failure count limit.

I also wonder how #8625 (commit 09a0f2d) affects this, but it is only in 0.2.9, and I'm not sure it would make it any better.

To be clear this is "Tor 0.2.9.3-alpha-dev (git-bfaded9143d127cb)" for me. So we are definitely in 0.2.9 territory.

But, I am going to move the mystery 1 investigation over to #20499, which looks like it is the same mystery over there. Then we can focus on mystery 2 in this ticket -- i.e. regardless of what our directory download situation is, Tor is supposed to wake up and reset stuff when it gets a new stream request and it doesn't have enough directory information to handle it.

comment:13 in reply to:  10 Changed 3 years ago by arma

Replying to arma:

My investigations led me to be suspicious of Tor commit b1d56fc58.
[...]
Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.

But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!

Still suspicious of commit b1d56fc58.

Is it possible that event_base_loop() in run_main_loop_once() is never finishing for me? After all, it has a bunch of callbacks, like second_elapsed_callback() and check_descriptor_callback() and so on, which all schedule themselves to get put back in the queue once they're done. Does this event_base_loop function keep running until there are no more events?

I see some ways it can choose to finish, like if flags is non-zero, or if connection_start_reading_from_linked_conn() calls tor_event_base_loopexit(), but neither of those apply to me here.

comment:14 Changed 3 years ago by nickm

maybe set a breakpoint on the outside of event_base_loop() and see if we reach there? Though we _should_ be reaching there if we get any incoming connections, since they count as events.

comment:15 Changed 3 years ago by arma

I set breakpoints all over run_main_loop_once() and they never catch. Maybe my gcc has optimized the function so it's hard to gdb on, but it looks suspiciously like event_base_loop(base, 0) just never finishes.

Where in the incoming connections events does it tell event_base_loop that it should finish?

comment:16 Changed 3 years ago by nickm

In theory, calls to event_base_loopbreak() or event_base_loopexit() will make event_base_loop() stop. But yeah, we don't seem to do that at an appropriate time.

I'll write a patch on 0.2.8 to test this.

comment:17 Changed 3 years ago by nickm

Keywords: 029-backport 028-backport added
Status: newneeds_review

Okay; I think that branch bug19969_028 in my public repository may have a fix here. If we take this one, we should add more tickets in 0.3.0 to clean up this code and simplify the control flow; I tried to do this with as little collateral damage as possible.

For a simpler fix, we could add a connection_ap_attach_pending() call to second_elapsed_callback().

comment:18 Changed 3 years ago by arma

Awesome. It builds for me and runs in a trivial test. The code looks plausible -- though I admit that there could be bugs in the way the patch calls the libevent stuff and I wouldn't know it.

I think you're right that the patch should ultimately make its way into 0.2.8 -- this is the current stable, and folks on crummy network connections are really hurting from this regression (if I were them I would be sticking to 0.2.7 at present, which is a poor idea for other reasons).

To be thorough, if this bug is what we think it is, then there are cases where the rest of the things in run_main_loop_once() never get called. Those are:

1)

  /* All active linked conns should get their read events activated,
   * so that libevent knows to run their callbacks. */
  SMARTLIST_FOREACH(active_linked_connection_lst, connection_t *, conn,
                    event_active(conn->read_event, EV_READ, 1));
  called_loop_once = smartlist_len(active_linked_connection_lst) ? 1 : 0;

Does this one matter? It might. But maybe because it's only about linked conns, they will probably only happen when there's other activity too? Worth thinking more about before we brush it off.

2)

  /* Make sure we know (about) what time it is. */
  update_approx_time(time(NULL));

No worries, this one happens in second_elapsed_callback() too.

3)

  connection_ap_attach_pending(0);

which is what your patch is about.

That's all of them, right?

comment:19 in reply to:  18 ; Changed 3 years ago by arma

Replying to arma:

1)

  /* All active linked conns should get their read events activated,
   * so that libevent knows to run their callbacks. */
  SMARTLIST_FOREACH(active_linked_connection_lst, connection_t *, conn,
                    event_active(conn->read_event, EV_READ, 1));
  called_loop_once = smartlist_len(active_linked_connection_lst) ? 1 : 0;

Does this one matter? It might. But maybe because it's only about linked conns, they will probably only happen when there's other activity too? Worth thinking more about before we brush it off.

Ah ha -- I think we're ok here, since active_linked_connection_lst only gets added to during connection_start_reading_from_linked_conn(), and that function has an explicit call to tor_event_base_loopexit() to bust us out of the loop.

That actually gives a suggestion for a third avenue for a fix -- in connection_ap_mark_as_pending_circuit_(), we add one of those calls to tor_event_base_loopexit(). Would that be simpler or am I mistaken?

comment:20 in reply to:  18 ; Changed 3 years ago by teor

Replying to arma:

...
2)

  /* Make sure we know (about) what time it is. */
  update_approx_time(time(NULL));

No worries, this one happens in second_elapsed_callback() too.
...

I wouldn't be so quick to brush this one off.

It could well the cause of #20423, which occurred when a relay operator on FreeBSD/LibreSSL upgraded to 0.2.8, and went away again when they reverted to 0.2.7.

comment:21 in reply to:  19 ; Changed 3 years ago by arma

Replying to arma:

That actually gives a suggestion for a third avenue for a fix -- in connection_ap_mark_as_pending_circuit_(), we add one of those calls to tor_event_base_loopexit(). Would that be simpler or am I mistaken?

My bug19969_028_alt branch implements this option. I found it useful to write to make sure I understood what was going on, so I will be totally fine if you pick your branch as being the less messy / more safe one. :)

comment:22 in reply to:  17 Changed 3 years ago by arma

Replying to nickm:

Okay; I think that branch bug19969_028 in my public repository may have a fix here.

@@ -768,6 +802,7 @@ connection_ap_rescan_and_attach_pending(void)
                pending_entry_connections);
       untried_pending_connections = 1;
       connection_ap_mark_as_pending_circuit(entry_conn);
+      schedule_pending_connections_event();
     }
 
   } SMARTLIST_FOREACH_END(conn);

isn't needed, right, since connection_ap_mark_as_pending_circuit() calls schedule_pending_connections_event() too?

comment:23 in reply to:  20 Changed 3 years ago by arma

Replying to teor:

Replying to arma:

...
2)

  /* Make sure we know (about) what time it is. */
  update_approx_time(time(NULL));

No worries, this one happens in second_elapsed_callback() too.
...

I wouldn't be so quick to brush this one off.

It could well the cause of #20423, which occurred when a relay operator on FreeBSD/LibreSSL upgraded to 0.2.8, and went away again when they reverted to 0.2.7.

I've looked at it some more, and I don't think it should be the cause of anything. The issue is that while we used to think that we update our approx time around once a second and also after a burst of event callbacks finishes, it turns out we only update it around once a second plus during the (relatively rare) times where we have a linked conn that wants to read. So I don't think that should be related to #20423.

comment:24 in reply to:  21 ; Changed 3 years ago by nickm

Replying to arma:

Replying to arma:

That actually gives a suggestion for a third avenue for a fix -- in connection_ap_mark_as_pending_circuit_(), we add one of those calls to tor_event_base_loopexit(). Would that be simpler or am I mistaken?

My bug19969_028_alt branch implements this option. I found it useful to write to make sure I understood what was going on, so I will be totally fine if you pick your branch as being the less messy / more safe one. :)

That's actually a little bad: calling event_base_loopexit() over and over like that is not idempotent. Instead, it creates "exit the loop" events proportional to the number of times you call it. That's not super harmful, but it does mean allocate more RAM and waste more CPU than one might want.

comment:25 Changed 3 years ago by nickm

Milestone: Tor: 0.3.0.x-finalTor: 0.2.9.x-final

comment:26 Changed 3 years ago by nickm

My bug19969_028_small is an alternative minimized version that tries to do this once a second unless the network is disabled.

comment:27 in reply to:  24 Changed 3 years ago by arma

Replying to nickm:

My bug19969_028_alt branch implements this option. I found it useful to write to make sure I understood what was going on, so I will be totally fine if you pick your branch as being the less messy / more safe one. :)

That's actually a little bad: calling event_base_loopexit() over and over like that is not idempotent. Instead, it creates "exit the loop" events proportional to the number of times you call it. That's not super harmful, but it does mean allocate more RAM and waste more CPU than one might want.

Ok, I have added a new commit to bug19969_028_alt that tries to avoid calling event_base_loopexit() when we shouldn't.

Also, I made a bug19969_029_alt branch that cherry-picks these three commits onto maint-0.2.9, since we are considering not putting this version of the patch into 0.2.8.

This is getting messier as I look at it more though. Now when I do a socks request while my Tor doesn't have enough dir info, I actually get k "optimistically trying directory fetches again" lines, where k is the number of streams that are pending each time a new pending one gets added. That's surprising, but I haven't sorted out whether I think it's harmful -- maybe not, because each of them just resets directory fetching stuff, or maybe so, because if they reset the directory fetching stuff, they will clobber any new efforts we made to mark down fallback dirs as actually down? Hm. There is a little part of me that is tempted to try to back out commits b1d56fc589, 84b3350c83, and 58edf926 and call that the 028 solution. But they don't revert cleanly, and I haven't explored further.

comment:28 Changed 3 years ago by arma

In other news, I can confirm that indeed the bottom of run_main_loop_once() is only run when we have an active linked connection. Here is the simple patch that lets you see for yourself:

diff --git a/src/or/main.c b/src/or/main.c
index ebaab5c..70fcc28 100644
--- a/src/or/main.c
+++ b/src/or/main.c
@@ -2576,6 +2576,7 @@ run_main_loop_once(void)
     }
   }
 
+  log_notice(LD_NET, "at bottom of main loop");
   /* This will be pretty fast if nothing new is pending. Note that this gets
    * called once per libevent loop, which will make it happen once per group
    * of events that fire, or once per second. */

So now that we have a bit more clarity, the bug in 0.2.8 (regression over 0.2.7) is two-fold:

A) If !have_dir_info, new stream requests get put onto a pending queue which is never looked at, and eventually they expire. The old behavior was that they would trigger a reset in directory downloads, which would hopefully cause have_dir_info to become true, and they would check once a second if it is.

B) New stream requests get put onto a pending queue which is examined sometime in the next second (each time run_scheduled_events() happens). So on average we have an extra 0.0-1.0 (average 0.5) seconds of latency added to stream requests. The old behavior was to try each new stream immediately.

comment:29 in reply to:  26 Changed 3 years ago by arma

Replying to nickm:

My bug19969_028_small is an alternative minimized version that tries to do this once a second unless the network is disabled.

That is a compellingly small fix.

I bet it would work -- specifically, it would resolve bug A in the previous comment, and would leave bug B in place, and I don't think it would break things much.

I spent a little while pondering yet another alternative approach: trying to draw up the big "if it's not going to work, optimistically try again" clause from circuit_get_open_circ_or_launch() and put it into connection_ap_mark_as_pending_circuit(), so we trigger the directory reset right then, and then we don't need to call those other functions at all when !have_dir_info. But I think that approach would end up too messy also.

One thing I'm still nervous about though is having one fix in 028 and a different one in 029. It would mean that we effectively never test the 028 fix until it gets into the hands of users, who think of it as a stable update. Whereas if put something convincing into 029, and then also had it in 028, we would have a better chance of finding problems with it.

Can we balance that by putting your 028 fix into both 028 *and* 029, and then my 029 fix into just 029? Or will we be happier if we pick just one of them (either one, really) and use it for both backports?

Hm. I feel like I'm overthinking, but also this is complicated code and getting it wrong would be no fun.

I think I favor either (1) putting your 028 into both 028 and 029, and my 029 into 029, or (2) putting my 029 into both 028 and 029. With slight preference for option 2 at present, since (i) the thing that is in 028 would get testing in 029, and (ii) it would resolve bug B as well.

comment:30 Changed 3 years ago by arma

I'm running bug19969_028_small and it seems to work under normal operation.

Oct 31 22:50:11.329 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).
Oct 31 22:50:12.329 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).
Oct 31 22:50:38.328 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).
Oct 31 22:50:39.329 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).
Oct 31 22:51:05.329 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).
Oct 31 22:51:06.328 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).
Oct 31 22:51:32.329 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).
Oct 31 22:51:33.329 [debug] connection_ap_handshake_attach_circuit(): Attaching apconn to circ 2444682090 (stream 0 sec old).

I wonder if there are any good linking / cookie attacks if I become known as "the guy whose run_scheduled events is offset by 328 milliseconds?" :)

And then running it on a bridge that I can't bootstrap from... I get a growing barrage of these "Optimistically trying directory fetches again" lines, one for every open socks request, every time I get a new socks request. Yucko. In retrospect when we did the 'retry' parameter, we should have put a flag on each entry_connection_t, and then just tried to attach the ones that had the flag set. Easy to say in retrospect.

comment:31 Changed 3 years ago by nickm

I think I favor either (1) putting your 028 into both 028 and 029, and my 029 into 029, or (2) putting my 029 into both 028 and 029. With slight preference for option 2 at present, since (i) the thing that is in 028 would get testing in 029, and (ii) it would resolve bug B as well.

What if we go with option (3) putting both patches in both?

comment:32 in reply to:  31 Changed 3 years ago by teor

Replying to nickm:

I think I favor either (1) putting your 028 into both 028 and 029, and my 029 into 029, or (2) putting my 029 into both 028 and 029. With slight preference for option 2 at present, since (i) the thing that is in 028 would get testing in 029, and (ii) it would resolve bug B as well.

What if we go with option (3) putting both patches in both?

Seems like a good idea.
In general, I don't like the idea of testing different code in 0.2.9 than the code we plan to backport to 0.2.8.

comment:33 Changed 3 years ago by teor

Code reviews:

nickm/bug19969_028_small and arma/bug19969_02[89]_alt seem to be quite small changes.
They make sense, I am a little concerned about the extra events in the arma patch[es].

I wonder if it would be possible to unit test or integration test these?

comment:34 Changed 3 years ago by teor

Oh, and arma/bug19969_02[89]_alt introduce a spacing issue:

void tell_event_loop_to_finish(void)

should be:

void
tell_event_loop_to_finish(void)

comment:35 Changed 3 years ago by teor

Oh, and two compiler warnings:

src/or/main.c:732:6: warning: no previous prototype for 'tell_event_loop_to_finish' [-Wmissing-prototypes]
 void tell_event_loop_to_finish(void)
      ^~~~~~~~~~~~~~~~~~~~~~~~~
src/or/main.c:732:6: warning: no previous prototype for 'tell_event_loop_to_finish' [-Wmissing-prototypes]
 void tell_event_loop_to_finish(void)
      ^~~~~~~~~~~~~~~~~~~~~~~~~

comment:36 in reply to:  35 Changed 3 years ago by teor

Replying to teor:

Oh, and two compiler warnings:

src/or/main.c:732:6: warning: no previous prototype for 'tell_event_loop_to_finish' [-Wmissing-prototypes]
 void tell_event_loop_to_finish(void)
      ^~~~~~~~~~~~~~~~~~~~~~~~~
src/or/main.c:732:6: warning: no previous prototype for 'tell_event_loop_to_finish' [-Wmissing-prototypes]
 void tell_event_loop_to_finish(void)
      ^~~~~~~~~~~~~~~~~~~~~~~~~

Sorry, this was a bad merge on my part.

comment:37 in reply to:  31 ; Changed 3 years ago by arma

Replying to nickm:

What if we go with option (3) putting both patches in both?

My bug19969_028_alt branch now implements this idea.

And my bug19969_028_squashed branch is the one that we would actually want to merge.

comment:38 in reply to:  37 Changed 3 years ago by teor

Replying to arma:

Replying to nickm:

What if we go with option (3) putting both patches in both?

My bug19969_028_alt branch now implements this idea.

And my bug19969_028_squashed branch is the one that we would actually want to merge.

They look good to me, and work in short-term chutney testing.

comment:39 Changed 3 years ago by arma

Status: needs_reviewmerge_ready

(Nick, before you merge be sure to read the new commit message I pretended that you wrote for b2f82d45b, just in case you don't want to have said that.)

comment:40 Changed 3 years ago by nickm

Milestone: Tor: 0.2.9.x-finalTor: 0.2.8.x-final
Resolution: fixed
Status: merge_readyclosed

Seems okay to me. Merging to 028 and forward. THanks!

comment:41 Changed 3 years ago by viktorj

Unfortunately the bug is not fixed for me in 0.2.9.5-alpha. Here is the log after my PC woke up from standby:

Nov 09 13:58:52.000 [notice] Your system clock just jumped 3622 seconds forward; assuming established circuits no longer work.
Nov 09 14:01:09.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 09 14:01:59.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 09 14:03:43.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 09 14:04:18.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:995. Giving up. (waiting for circuit)
Nov 09 14:04:58.000 [notice] No circuits are opened. Relaxed timeout for circuit 21 (a General-purpose client 3-hop circuit in state doing handshakes with channel state open) to 348450ms. However, it appears the circuit has timed out anyway. 14 guards are live.
Nov 09 14:04:59.000 [notice] Tor has not observed any network activity for the past 4200 seconds. Disabling circuit build timeout recording.
Nov 09 14:05:57.000 [notice] Tor now sees network activity. Restoring circuit build timeout recording. Network was down for 4258 seconds during 2 circuit attempts.
Nov 09 14:06:01.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 09 14:08:10.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 09 14:09:27.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 09 14:10:33.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 09 14:10:47.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Nov 09 14:10:47.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.

During these connection attempts to my mail providers, I could use the Tor Browser for surfing, which in my case directly connects to the Tor network (no bridges or something else), so there should not be a network connectivity problem here.

Since the bug is of course annoying for me, tell me if I can give you further information or if I can help you in any other way.

Last edited 3 years ago by viktorj (previous) (diff)

comment:42 Changed 3 years ago by nickm

Resolution: fixed
Status: closedreopened

comment:43 Changed 3 years ago by arma

viktorj: can you tell us about the state of your network connection during those 12 minutes after your PC left standby? Like, do you think you had a good internet connection immediately, or did it take 5 minutes?

comment:44 Changed 3 years ago by arma

My first thought is that your Tor wasn't asleep long enough for router_have_minimum_dir_info() to become false, so that's why Tor didn't think it was necessary to jump into action when it woke up and saw a new stream request.

comment:45 Changed 3 years ago by arma

In particular, does Tor 0.2.7.6 behave better in the case where your Tor client is asleep for only an hour? It's possible that this edge case was never handled well (and so isn't a regression).

comment:46 in reply to:  43 Changed 3 years ago by viktorj

Replying to arma:

viktorj: can you tell us about the state of your network connection during those 12 minutes after your PC left standby? Like, do you think you had a good internet connection immediately, or did it take 5 minutes?

I think my internet connection was good because I started the Tor Browser during these 12 minutes (maybe 1 minute after starting Thunderbird), it had no problems to connect to the network immediately and I could browse as usual.

I use wifi and have cable internet (~ max. 36 MBit) here in Germany. Today I looked at the connection speed while watching a video in Firefox after I left standby, and I reached about 36 MBit constantly during the background download of the video. I never had problems like slow internet, either it works fast or it doesn't work at all (sometimes for whatever reason, but this doesn't matter here).

comment:47 in reply to:  45 Changed 3 years ago by viktorj

Replying to arma:

In particular, does Tor 0.2.7.6 behave better in the case where your Tor client is asleep for only an hour? It's possible that this edge case was never handled well (and so isn't a regression).

Tor 0.2.7.6 behaves much better in the sense that I never ran into the timeout in Thunderbird, although it usually takes a bit to connect to the mail servers. Here is a log from August when I compared different versions of Tor:

Aug 23 12:19:28.000 [notice] Your system clock just jumped 3622 seconds forward; assuming established circuits no longer work.
Aug 23 12:19:42.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Aug 23 12:19:42.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.

During the last days I always opened Thunderbird immediately after leaving standby and tried to connect to different mail servers, and strangely I didn't run into the Thunderbird timeout sometimes. Here are two examples of that with the shortest and the longest time of standby mode:

Nov 12 17:35:29.000 [notice] Your system clock just jumped 2533 seconds forward; assuming established circuits no longer work.
Nov 12 17:37:35.000 [notice] No circuits are opened. Relaxed timeout for circuit 85 (a General-purpose client 3-hop circuit in state doing handshakes with channel state open) to 125338ms. However, it appears the circuit has timed out anyway. 13 guards are live. [2 similar message(s) suppressed in last 3600 seconds]
Nov 12 17:37:37.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Nov 12 17:37:37.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.

Nov 11 18:11:32.000 [notice] Your system clock just jumped 26150 seconds forward; assuming established circuits no longer work.
Nov 11 18:12:20.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Nov 11 18:12:20.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.

But I got the timeout with shorter times of standby and with times in between the two above. If you are interested in more logs, I can give them to you.

comment:48 Changed 3 years ago by viktorj

It also seems that your patches didn't change anything for me because Tor 0.2.9.1-alpha behaved similarly to Tor 0.2.9.5-alpha, as you can see in these logs:

Aug 24 18:31:01.000 [notice] Your system clock just jumped 4243 seconds forward; assuming established circuits no longer work.
Aug 24 18:33:16.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Aug 24 18:34:33.000 [notice] No circuits are opened. Relaxed timeout for circuit 29 (a General-purpose client 1-hop circuit in state doing handshakes with channel state open) to 198078ms. However, it appears the circuit has timed out anyway. 12 guards are live.
Aug 24 18:34:36.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Aug 24 18:34:36.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.

Aug 24 20:57:51.000 [notice] Your system clock just jumped 4102 seconds forward; assuming established circuits no longer work.
Aug 24 20:58:11.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Aug 24 20:58:11.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.

So the question probably is what makes the difference here.

comment:49 Changed 3 years ago by viktorj

Today my wifi access point had a problem so that I had no internet access until it was restarted. I rebooted my PC before rebooting the access point because I wanted to be sure that it was not a problem with the PC. With the wifi access point working again I could use Firefox and the Tor Browser, but the separate Tor client (0.2.9.5-alpha) which I use for Thunderbird + Torbirdy did not open a circuit or was even unable to connect to the Tor network, so I restarted the client. Here is the log:

Nov 19 19:20:11.000 [warn] Problem bootstrapping. Stuck at 80%: Connecting to the Tor network. (Network is unreachable; NOROUTE; count 1; recommendation warn; host EBE718E1A49EE229071702964F8DB1F318075FF8 at 131.188.40.188:80)
(85 similar entries)
Nov 19 19:24:56.000 [warn] Problem bootstrapping. Stuck at 80%: Connecting to the Tor network. (Network is unreachable; NOROUTE; count 87; recommendation warn; host 4F219D54CD8C5A5ED5B4D57A101D89BA81527841 at 62.210.106.22:9666)
Nov 19 18:34:53.000 [notice] Your system clock just jumped 3012 seconds backward; assuming established circuits no longer work.
Nov 19 18:34:54.000 [notice] Heartbeat: Tor's uptime is 0:04 hours, with 0 circuits open. I've sent 0 kB and received 0 kB.
Nov 19 18:37:24.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 19 18:39:27.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:993. Giving up. (waiting for circuit)
Nov 19 18:40:32.000 [notice] Interrupt: exiting cleanly.

Would you be interested in a more detailed log, e.g. for everything that happens after waking up from standy?

comment:50 Changed 3 years ago by teor

I have a tor 0.2.8.9 that is refusing to make any outgoing connections, likely due to this bug.

It happened after I left my laptop asleep for a day (112503 seconds),
then woke it up, then waited a few minutes (151 seconds) before
connecting to a new wifi network.

I have found one workaround:

  • turn on bridges in Tor Browser's settings, wait for the consensus to be downloaded, then turn them off again.

comment:51 Changed 3 years ago by nickm

Does Tor 0.2.8.10 fix this by backporting #19969 ?

comment:52 Changed 3 years ago by nickm

Resolution: fixed
Status: reopenedclosed

comment:53 in reply to:  50 Changed 3 years ago by viktorj

teor: Can you still reproduce this bug with Tor 0.2.8.10 or 0.2.8.11? I would be interested in this because there seem to be very few people using standy mode.

comment:54 in reply to:  51 Changed 3 years ago by viktorj

Replying to nickm:

Does Tor 0.2.8.10 fix this by backporting #19969 ?

I have downgraded Tor from 0.2.9.6-rc to 0.2.8.11 to test this. As I use standby mode almost every day, I should be able to answer this soon.

Note: See TracTickets for help on using tickets.