Opened 5 months ago

Closed 2 months ago

#24392 closed defect (fixed)

Ignore cached bridge descriptors until we check if they are running

Reported by: teor Owned by: teor
Priority: Medium Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version: Tor: 0.3.0.1-alpha
Severity: Normal Keywords: 031-backport, regression, tor-bridge-client, s8-errors, review-group-27
Cc: teor, brade, mcs, nickm, asn, catalyst Actual Points: 0.7
Parent ID: #24367 Points: 0.3
Reviewer: Sponsor:

Description

Split off #24367:

[An] underlying issue here is that every use of any_bridge_descriptors_known() in tor is wrong. Instead, we want to know if num_bridges_usable() > 0.

This is a bug introduced in commit 93a8ed3 in 0.3.2.1-alpha by #23347. But it was also a bug in commit eca2a30 in 0.2.0.3-alpha, but we've never encountered it before, because we always retried our bridges immediately (and far too often).

My branch bug24367 at ​https://github.com/teor2345/tor.git passes all of make test-network, but fails some of the guard and microdesc unit tests. This probably means the unit tests were relying on the buggy behaviour, and need to be changed. I might need nickm or asn to help me fix the unit tests, because they wrote them.

Child Tickets

TicketStatusOwnerSummaryComponent
#23524closedteorAvoid crashing when we ask for running bridges, but UseBridges is 0Core Tor/Tor

Change History (29)

comment:1 Changed 5 months ago by teor

Status: assignedneeds_revision

Assigned to nickm for the unit test fixes, like we did in #24367 before it became about Tor Launcher.

comment:2 Changed 5 months ago by teor

In #24367, this patch fixes this notice:

[notice] Ignoring directory request, since no bridge nodes are available yet.

comment:3 Changed 5 months ago by nickm

Status: needs_revisionneeds_review

Okay, I have a fix, but I haven't found time to investigate _why_ it works: see branch bug24367_032 in my public repo.

comment:4 Changed 5 months ago by teor

Status: needs_reviewmerge_ready

It likely always messed with the global state, it's just that we never checked that global state before.

This looks fine, let's get it merged. It won't fix all of #24367, but at least it will make the remaining issues easier to find.

I don't think we should backport: #23347 and #17750 make this bug have no effect in earlier versions.

comment:5 Changed 5 months ago by arma

I'm stuck in a maze of twisty little tickets, all being parents of each other, but: in proposed commit 690f646bf:

diff --git a/src/or/circuituse.c b/src/or/circuituse.c
index aa0df95..8c9859b 100644
--- a/src/or/circuituse.c
+++ b/src/or/circuituse.c
@@ -2084,7 +2084,7 @@ circuit_get_open_circ_or_launch(entry_connection_t *conn,
                "used client functionality lately" :
                "received a consensus with exits",
                options->UseBridges ? "bridges" : "entrynodes");
-      } else if (!options->UseBridges || any_bridge_descriptors_known()) {
+      } else if (!options->UseBridges || num_bridges_usable() > 0) {
         log_fn(severity, LD_APP|LD_DIR,
                "Application request when we haven't %s. "
                "Optimistically trying directory fetches again.",

teor, did you check that you didn't break this functionality? We have broken it several times in the past and taken years to notice.

comment:6 Changed 5 months ago by arma

(#14216 is the ticket where I fixed it last)

comment:7 in reply to:  5 Changed 5 months ago by teor

Replying to arma:

I'm stuck in a maze of twisty little tickets, all being parents of each other

Maybe trac needs a tree view? :-)

, but: in proposed commit 690f646bf:

diff --git a/src/or/circuituse.c b/src/or/circuituse.c
index aa0df95..8c9859b 100644
--- a/src/or/circuituse.c
+++ b/src/or/circuituse.c
@@ -2084,7 +2084,7 @@ circuit_get_open_circ_or_launch(entry_connection_t *conn,
                "used client functionality lately" :
                "received a consensus with exits",
                options->UseBridges ? "bridges" : "entrynodes");
-      } else if (!options->UseBridges || any_bridge_descriptors_known()) {
+      } else if (!options->UseBridges || num_bridges_usable() > 0) {
         log_fn(severity, LD_APP|LD_DIR,
                "Application request when we haven't %s. "
                "Optimistically trying directory fetches again.",

teor, did you check that you didn't break this functionality? We have broken it several times in the past and taken years to notice.

(#14216 is the ticket where I fixed it last)

#14216 was a partial fix. This ticket may be part of a more complete fix.

This functionality has been easy to break *because* of this bug #24392, which goes back to commit eca2a30 in 0.2.0.3-alpha. Checking for cached bridge descriptors is definitely the wrong thing to do, because it can find the wrong type of bridges, and it can find non-running bridges. We need to check for live bridges instead. This fix needs to be applied regardless of any other bugs.

Once it's fixed, we will deal with any remaining issues in #24367, which happened (or was exposed as a bug) due to #23347 and #17750. I'll add a child ticket to #24367 so we test optimistic retries (that ticket is long enough and mentions too many issues as it is).

comment:8 Changed 5 months ago by teor

I opened #24486 to fix this issue. It's in a different part of the code to this change, so let's deal with it separately?

comment:9 Changed 5 months ago by asn

Hello, please let me know if there is something you would like me to do on this ticket and I'll do it. Otherwise, I'll ignore it for now.

comment:10 Changed 5 months ago by nickm

Roger, please let me know if you think I should not merge this fix until #24486 is resolved. Otherwise I plan to merge it for inclusion in the next alpha.

comment:11 Changed 5 months ago by nickm

Keywords: review-group-27 added

comment:12 Changed 4 months ago by teor

Status: merge_readyneeds_revision

I want to check this again.

comment:13 in reply to:  5 Changed 4 months ago by teor

Actual Points: 0.30.5
Status: needs_revisionneeds_review

Replying to arma:

I'm stuck in a maze of twisty little tickets, all being parents of each other, but: in proposed commit 690f646bf:

diff --git a/src/or/circuituse.c b/src/or/circuituse.c
index aa0df95..8c9859b 100644
--- a/src/or/circuituse.c
+++ b/src/or/circuituse.c
@@ -2084,7 +2084,7 @@ circuit_get_open_circ_or_launch(entry_connection_t *conn,
                "used client functionality lately" :
                "received a consensus with exits",
                options->UseBridges ? "bridges" : "entrynodes");
-      } else if (!options->UseBridges || any_bridge_descriptors_known()) {
+      } else if (!options->UseBridges || num_bridges_usable() > 0) {
         log_fn(severity, LD_APP|LD_DIR,
                "Application request when we haven't %s. "
                "Optimistically trying directory fetches again.",

teor, did you check that you didn't break this functionality? We have broken it several times in the past and taken years to notice.

You'll love this one, arma.

It is impossible to break this functionality by changing that condition, because !options->UseBridges is always true at this point in the code.

For an explanation, see the commit message on:

[bug24367_032 e9cb0cd7c8] Simplify some conditionals in circuit_get_open_circ_or_launch()

I also added a flag to num_bridges_usable() that lets us count "definitely reachable" and "maybe reachable or definitely reachable" bridges. I think that deals with some of the issues in #24367:

[bug24367_032 bca43d548a] Make sure bridges are definitely running before delaying directory fetches

We might still be missing a call to download_status_reset() somewhere, but I had a quick look, and it seemed ok.

comment:14 Changed 4 months ago by teor

Oh, the branch here is my bug24367_032 on github.

comment:15 Changed 4 months ago by teor

Keywords: 030-backport 031-backport added
Version: Tor: 0.3.2.1-alphaTor: 0.3.0.1-alpha

I am no longer convinced that this bug is restricted to 0.3.2, we should consider backportong these changes to 0.3.0 as a precautionary measure.

comment:16 Changed 4 months ago by teor

Owner: changed from nickm to teor
Status: needs_reviewassigned

nickm fixed the unit test issue here, I wrote the rest of the code.

comment:17 Changed 4 months ago by teor

Status: assignedneeds_review

comment:18 Changed 4 months ago by nickm

Hi! I have some questions, that are not necessarily blockers.

In the new commit bca43d548ad7301fc3698e8ded0a94d43a41858d:

-    int first = num_bridges_usable() <= 1;
+    /* Retry directory downloads whenever we get a bridge descriptor:
+     * - when bootstrapping, and
+     * - when we aren't sure if any of our bridges are reachable.
+     * Keep on retrying until we have at least one reachable bridge. */
+    int first = num_bridges_usable(0) < 1;

Do we know why this was <= before? It looks like I introduced the code in bca43d548ad7301fc3698e8ded0a94d43a41858d, but I don't understand what my original rationale was, so I'm a little uncertain here. I'll try to see if I can puzzle this out.

Second question: In e9cb0cd7c8fe03dbd7e0fef91ab868cb56f280b0, when we do

-      } else if (!options->UseBridges || num_bridges_usable() > 0) {
+      } else {

can we add a tor_nonfatal_assert() there to make sure that the second case really and truly only happens when we think it does?

Last:

I am no longer convinced that this bug is restricted to 0.3.2, we should consider backporting these changes to 0.3.0 as a precautionary measure.

This is a pretty complex set of changes, and the rebase isn't clean. Is there a minimal set of changes that, we could put into a branch based on 0.3.0?

And final question: how is the testing on this?

comment:19 in reply to:  18 ; Changed 4 months ago by teor

Status: needs_reviewneeds_revision

Replying to nickm:

Hi! I have some questions, that are not necessarily blockers.

In the new commit bca43d548ad7301fc3698e8ded0a94d43a41858d:

-    int first = num_bridges_usable() <= 1;
+    /* Retry directory downloads whenever we get a bridge descriptor:
+     * - when bootstrapping, and
+     * - when we aren't sure if any of our bridges are reachable.
+     * Keep on retrying until we have at least one reachable bridge. */
+    int first = num_bridges_usable(0) < 1;

Do we know why this was <= before? It looks like I introduced the code in bca43d548ad7301fc3698e8ded0a94d43a41858d, but I don't understand what my original rationale was, so I'm a little uncertain here. I'll try to see if I can puzzle this out.

I suspect that it was num_bridges_usable() <= 1 because in the old code that meant:
"keep on trying until we have more than one bridge that is maybe or definitely reachable".
This isn't quite right, because two maybe-reachable bridges can still be useless to a client.

But now num_bridges_usable(0) < 1 means:
"keep on trying until we have one bridge that is definitely reachable".
I think that's what we actually want, and I added the comment for future reference.

Second question: In e9cb0cd7c8fe03dbd7e0fef91ab868cb56f280b0, when we do

-      } else if (!options->UseBridges || num_bridges_usable() > 0) {
+      } else {

can we add a tor_nonfatal_assert() there to make sure that the second case really and truly only happens when we think it does?

Yes, I'll do this in the morning.

Last:

I am no longer convinced that this bug is restricted to 0.3.2, we should consider backporting these changes to 0.3.0 as a precautionary measure.

This is a pretty complex set of changes, and the rebase isn't clean. Is there a minimal set of changes that, we could put into a branch based on 0.3.0?

I'll look into this in the morning.
I think we only need the int first = num_bridges_usable(0) < 1; change.

Because } else if (!options->UseBridges || num_bridges_usable() > 0) { is redundant, and the final change is on code that didn't exist before 0.3.2.

And final question: how is the testing on this?

It passes all of make test-network-all, and gk likes it better than any of the previous versions when testing bridge switches in Tor Browser. But there's still a bug in #23524 where an old cached bridge descriptor / state file entry that is no longer in the torrc still forms part of the set of guards.

I was hoping someone else would have a look at that one, I'm not that familiar with the new guard code.

comment:20 in reply to:  19 Changed 4 months ago by teor

Status: needs_revisionneeds_review

Replying to teor:

Replying to nickm:

Hi! I have some questions, that are not necessarily blockers.

...

Second question: In e9cb0cd7c8fe03dbd7e0fef91ab868cb56f280b0, when we do

-      } else if (!options->UseBridges || num_bridges_usable() > 0) {
+      } else {

can we add a tor_nonfatal_assert() there to make sure that the second case really and truly only happens when we think it does?

Yes, I'll do this in the morning.

Done and pushed in:

[bug24367_032 42a77862ab] fixup! Simplify some conditionals in circuit_get_open_circ_or_launch()

Waiting for it to compile.

Last:

I am no longer convinced that this bug is restricted to 0.3.2, we should consider backporting these changes to 0.3.0 as a precautionary measure.

This is a pretty complex set of changes, and the rebase isn't clean. Is there a minimal set of changes that, we could put into a branch based on 0.3.0?

I'll look into this in the morning.
I think we only need the int first = num_bridges_usable(0) < 1; change.

Because } else if (!options->UseBridges || num_bridges_usable() > 0) { is redundant, and the final change is on code that didn't exist before 0.3.2.

I will work on this now.

And final question: how is the testing on this?

It passes all of make test-network-all, and gk likes it better than any of the previous versions when testing bridge switches in Tor Browser.

Now the testing also has Travis CI™.base

comment:21 Changed 4 months ago by teor

Milestone: Tor: 0.3.2.x-finalTor: 0.3.0.x-final
Status: needs_reviewneeds_revision

nickm squashed and merged to 0.3.2 and master.

Leaving this open for a potential backport to 0.3.0 and 0.3.1.

comment:22 Changed 4 months ago by teor

Actual Points: 0.50.7
Status: needs_revisionneeds_review

Please see my branch bug24367_030.

It's a one-line change that makes us stop fetching directory documents when we have no usable bridges, rather than no cached bridges. It also backports a downgrade from assert to BUG() in num_usable_bridges().

I didn't backport the other changes:

  • the download schedule change was based on code added in 0.3.2
  • the added argument to num_bridges_usable() adds complexity, and I don't think we need it in 0.3.0, because 0.3.0 checks bridge usability much more often than 0.3.2. (Unnecessarily often, in some cases.)

The fix has passed travis, and is cooking locally.
I am pretty happy that it will pass, I'll let you know if anything goes wrong.

comment:23 Changed 3 months ago by teor

Status: needs_reviewmerge_ready

comment:24 Changed 2 months ago by nickm

Keywords: 030-backport removed

Remove 030-backport from all open tickets that have it: 0.3.0 is now deprecated.

comment:25 Changed 2 months ago by nickm

Milestone: Tor: 0.3.0.x-final

Move all still-open 0.3.0 tickets to 0.2.9

comment:26 Changed 2 months ago by nickm

Milestone: Tor: 0.2.9.x-final

comment:27 Changed 2 months ago by nickm

Hm. Do you still think this is worth backporting? If so, how far? It does seem unlikely to cause trouble...

comment:28 Changed 2 months ago by teor

The impact of this issue is that some bridge clients on 0.3.1 will try to fetch directory documents even when they have no usable bridges.

I don't think we need to backport this fix, at worst, it could cause bridge clients to reach their download limits, and need a restart to reach the network again.

comment:29 Changed 2 months ago by teor

Resolution: fixed
Status: merge_readyclosed
Note: See TracTickets for help on using tickets.