Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#18616 closed defect (fixed)

Make begindir advertise checks consistent with DirPort checks

Reported by: toralf Owned by: teor
Priority: Medium Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version: Tor: 0.2.8.1-alpha
Severity: Normal Keywords: regression, must-fix-before-028-rc, TorCoreTeam201605, TorCoreTeam-postponed-201604, review-group-1
Cc: Actual Points: 25 hours
Parent ID: Points: 3
Reviewer: arma,teor Sponsor:

Description (last modified by arma)

This ticket makes sure the checks that Tor does when advertising begindir support are similar to the checks it does when advertising the DirPort.

In particular:

  • bridges should advertise begindir support
  • authorities should always advertise begindir
  • we should never advertise begindir if the network is disabled
  • we should never advertise begindir if we don't have an ORPort (redundant, as we don't post descriptors without an ORPort)
  • relays should handle AccountingMax like they do for DirPort

Child Tickets

TicketStatusOwnerSummaryComponent
#18851closedClarify GETINFO status/reachability-succeeded/dir with no DirPortCore Tor/Tor

Change History (47)

comment:1 Changed 3 years ago by toralf

Ah - and this comes now too

Mar 23 18:39:16.000 [warn] Your server (5.9.158.75:80) has not managed to confirm that its DirPort is reachable. Relays do not publish descriptors until their ORPort and DirPort are reachable. Please check your firewalls, ports, address, /etc/hosts file, etc.
Last edited 3 years ago by toralf (previous) (diff)

comment:2 Changed 3 years ago by nickm

This looks like a duplicate of #18351 ?

comment:3 Changed 3 years ago by teor

This is a genuine bug where a relay can't reach its own dirport due to firewall rules. It's related to #18351, but not the same.

comment:4 Changed 3 years ago by toralf

FWIW :

Mar 23 19:19:36.000 [notice] Not advertising DirPort (Reason: AccountingMax enabled)

I do have :

# 20 TB/month: echo "20 * 1024^4 / 31 / 24 / 60 / 60 / 1024^2" | bc
# == 8017
#
#BandwidthRate   8 MB
AccountingMax 20 TB
AccountingRule out

since Mid of February or so

comment:5 Changed 3 years ago by teor

I wonder if this is related to #12538.

comment:6 Changed 3 years ago by teor

Keywords: regression must-fix-before-028-rc added
Status: newneeds_review

Please see my branch bug18616 in https://github.com/teor2345/tor.git

It fixes an issue where accounting limits weren't taken into account when determining whether the relay is a dir server, and whether it needs to check DirPort reachability or not.

comment:7 Changed 3 years ago by teor

Oops, this is also related to #18050, see the updated changes file and commit message in bug18616-v2.

comment:8 Changed 3 years ago by teor

(The code that logged this message didn't include the changes in #18348. So I think the underlying bugs are fixed, let's release 0.2.8.2-alpha to find out. I'll deal with the noisy logging issue in #18351.)

comment:9 Changed 3 years ago by toralf

I could ofc bisect the issue(s) but b/c this has to be made in the real world therefore I'm unsure if that's a good idea.

comment:10 in reply to:  9 Changed 3 years ago by teor

Replying to toralf:

I could ofc bisect the issue(s) but b/c this has to be made in the real world therefore I'm unsure if that's a good idea.

Probably best to wait for 0.2.8.2-alpha rather than using bisect on the real world tor network.

It should be out really soon, I think we have all the bugs we needed for the next alpha done now.

comment:11 Changed 3 years ago by nickm

I'm not so sure about whether this patch. router_should_be_directory_server has been used to decide whether we should _advertise_ directory support, not whether we should _provide_ it. It also has side-effects of changing the 'advertising' static variable.

Should the check become, looking at the last-returned value from router_should_be_directory_server, rather than recalculating it?

comment:12 Changed 3 years ago by nickm

Keywords: TorCoreTeam201603 added

comment:13 Changed 3 years ago by dgoulet

Status: needs_reviewneeds_revision

The patch sounds about right imo if dir_server_mode now calls router_should_be_directory_server, we end up doing twice here:

  ri->supports_tunnelled_dir_requests = dir_server_mode(options) &&
    router_should_be_directory_server(options, ri->dir_port);

Maybe this above should be fixed also by simply removing the extra call. It sounds right that we stop advertising our directory support when our AccountinMax has been reached or our bandwidth is not enough all of the sudden.

Putting this one in needs_revision because if we are comfortable with this patch, the above should be fixed.

comment:14 Changed 3 years ago by nickm

Teor, Dgoulet: Do you think this is an alpha-blocker?

comment:15 in reply to:  14 ; Changed 3 years ago by dgoulet

Replying to nickm:

Teor, Dgoulet: Do you think this is an alpha-blocker?

TL;DR; imo, this need more work so we could wait after this alpha release.

I think we need to "massage" a bit this patch. For instance, check_whether_dirport_reachable will now call dir_server_mode thus router_should_be_directory_server but then in decide_to_advertise_dirport we call both functions one after the other.

Furthermore, I'm not entirely sure about the addition of dir_server_mode to the dirport reachable check function. Actually, this whole function is confusing:

/** Return 1 if we don't have a dirport configured, or if it's reachable. */
int
check_whether_dirport_reachable(void)
{
  const or_options_t *options = get_options();
  return !options->DirPort_set ||
         !dir_server_mode(options) ||
         options->AssumeReachable ||
         net_is_disabled() ||
         can_reach_dir_port;
}

So OK, we return 1 if DirPort is _NOT_ configured but why are we using can_reach_dir_port without negating it? Also, with this patch, dir_server_mode also checks the ORPort so is it OK to use it in there at all?

comment:16 in reply to:  15 Changed 3 years ago by teor

Replying to dgoulet:

Replying to nickm:

Teor, Dgoulet: Do you think this is an alpha-blocker?

TL;DR; imo, this need more work so we could wait after this alpha release.

I think we need to "massage" a bit this patch. For instance, check_whether_dirport_reachable will now call dir_server_mode thus router_should_be_directory_server but then in decide_to_advertise_dirport we call both functions one after the other.

Yes, I find this group of functions quite tangled. I am not sure how to simplify them though.

Furthermore, I'm not entirely sure about the addition of dir_server_mode to the dirport reachable check function. Actually, this whole function is confusing:

/** Return 1 if we don't have a dirport configured, or if it's reachable. */
int
check_whether_dirport_reachable(void)
{
  const or_options_t *options = get_options();
  return !options->DirPort_set ||
         !dir_server_mode(options) ||
         options->AssumeReachable ||
         net_is_disabled() ||
         can_reach_dir_port;
}

So OK, we return 1 if DirPort is _NOT_ configured but why are we using can_reach_dir_port without negating it?

We treat the DirPort is "reachable" if we can't check it for any reason:

         !options->DirPort_set ||
         !dir_server_mode(options) ||
         options->AssumeReachable ||
         net_is_disabled() ||

Or if we have checked and it's reachable:

         can_reach_dir_port;

I think I added a comment to this effect in the branch. If not, we should.

Also, with this patch, dir_server_mode also checks the ORPort so is it OK to use it in there at all?

Yes, it's not really possible to run a directory mirror without an ORPort. So checking the ORPort is OK. But maybe this could be part of the code simplification?

Do you have any ideas here, dgoulet? I'm stuck, and focused on other things right now.

comment:17 Changed 3 years ago by nickm

Keywords: TorCoreTeam201604 added; TorCoreTeam201603 removed

comment:18 Changed 3 years ago by teor

I think I can update the patch so it cleanly answers the following questions:

  • do we have a dirport open?
    • if we have a dirport open, is the DirPort reachable?
      • if the DirPort is reachable, do we want to advertise it?
  • would we answer begindir requests?
    • if we would answer begindir requests, is the ORPort reachable?
      • if the ORPort is reachable, do we want to advertise begindir support?

As nickm notes in comment 11, I need to make sure we do the reachability and advertisability checks, store the results, and then use the results to answer these questions.

As dgoulet notes in 13 & 15, I should also remove redundant function calls.

comment:19 Changed 3 years ago by teor

Summary: Bug: Node search initiated by. Stack trace:Relay fails to self-test its DirPort with AccountingMax enabled

comment:20 in reply to:  18 Changed 3 years ago by teor

Actual Points: 16 hours
Points: medium
Reviewer: dgoulet
Status: needs_revisionneeds_review

I think part of this bug was due to #18623, and the logging is being tracked in #18849.

This leaves the fixes for #12538 and #18050:

Replying to teor:

I think I can update the patch so it cleanly answers the following questions:

  • do we have a dirport open?

get_options()->DirPort_set needs no changes.

  • if we have a dirport open, is the DirPort reachable?

check_whether_dirport_reachable() returns 0 when we need to do a reachability check,
and 1 if we've successfully done a reachability check, or if there's no reason to do a reachability check. It needs changes for clarity.

004e1c27 refactors common parts of the ORPort and DirPort check_*_reachable functions into router_reachability_checks_disabled(), and updates the function comments. It disables OR reahcability checks when net_is_disabled(), for consistency with Dir reachability checks.

#18851 updates the control-spec to clarify this behaviour.

  • if the DirPort is reachable, do we want to advertise it?

decide_to_advertise_dirport() needs no changes.

  • do we have an orport open?

get_options()->ORPort_set needs no changes.

  • would we answer begindir requests?

directory_permits_begindir_requests() needs no changes, but should be used to set supports_tunnelled_dir_requests. (See 4dda75fc below.)

This resolves a nasty edge case with bridges, which should always support begindir.

  • if we would answer begindir requests, is the ORPort reachable?

check_whether_orport_reachable() is refactored in 004e1c27, see notes above

  • if the ORPort is reachable, do we want to advertise begindir support?

4dda75fc creates decide_to_advertise_begindir(), like decide_to_advertise_dirport(), which should clean up a lot of nasty edge cases:

  • directory authorities should always support and advertise begindir,
  • ORs with disabled networks shouldn't advertise support for begindir (possibly not an issue, but done for consistency with DirPort),
  • ORPort reachability is required for advertisement (possibly not an issue, but done for consistency with DirPort),
  • we now call router_should_be_directory_server() once per descriptor upload, which I hope satisfies nickm's request to call it infrequently (NM1).

As nickm notes in comment 11, I need to make sure we do the reachability and advertisability checks, store the results, and then use the results to answer these questions.

NM1: This now happens once per descriptor upload, which means tor only logs when actually making the decision for each descriptor.

As dgoulet notes in 13 & 15, I should also remove redundant function calls.

DG1: I don't believe there are any redundant function calls in this patch.

Also, a28d98f4 adds the options used by these functions to the list of options we check to see if we need to rebuild our descriptor.

Finally, we need to update our descriptor when we change our mind about advertising the dirport or begindir support due to accounting max. We didn't do this in the past for DirPort, and we do it every 18 hours anyway, so I'm splitting it off into #18852 in case we want to defer it to 0.2.9 or later.

comment:21 Changed 3 years ago by teor

Oops, see my branch bug18616-v3 on https://github.com/teor2345/tor.git

comment:22 Changed 3 years ago by arma

1)

\ No newline at end of file

for the changes file will produce surprising results when somebody cats all the changes files together.

2)

+static int router_reachability_checks_disabled(void)
 {
   const or_options_t *options = get_options();

As a static helper function, it might be a bit cleaner to pass in options to it, especially since one of the two callers already has an 'options' hanging out ready to be passed in. We're certainly not consistent about this, but I think it's the righter thing to do.

3) I like 004e1c2704 but it looks like it does two things -- first is the refactor, and second is the change where check_whether_orport_reachable() considers net_is_disabled(). "Refactor" commits are easiest to review when they don't also change behavior. This could become two commits, the first adding the net_is_disabled() to check_whether_orport_reachable(), and the second adding the modular function along with a note "no actual changes in behavior" to make it clear that it's just a refactor.

3b) Does the behavior change from 004e1c2704 warrant a changelog entry? I think it probably does.

Also, I looked through the calls to check_whether_orport_reachable() and I don't think that behavior change will surprise us. Good.

3c) in control.c:

    } else if (!strcmp(question, "status/reachability-succeeded/or")) {
      *answer = tor_strdup(check_whether_orport_reachable() ? "1" : "0");
    } else if (!strcmp(question, "status/reachability-succeeded/dir")) {
      *answer = tor_strdup(check_whether_dirport_reachable() ? "1" : "0");
    } else if (!strcmp(question, "status/reachability-succeeded")) {
      tor_asprintf(answer, "OR=%d DIR=%d",
                   check_whether_orport_reachable() ? 1 : 0,
                   check_whether_dirport_reachable() ? 1 : 0);

are documented as

    "status/reachability-succeeded/or"
      0 or 1, depending on whether we've found our ORPort reachable.
    "status/reachability-succeeded/dir"
      0 or 1, depending on whether we've found our DirPort reachable.
    "status/reachability-succeeded"

I think the "we aren't trying to test reachability on it, so return 1" behavior might be surprising here. On the other hand, the reachability did *succeed*, in that we're not unhappy with the current state of things. Yuck. No need to fix it here but maybe we want to clean this up in the future?

4) In commit 4dda75fc0 we end up with a new function decide_to_advertise_begindir(), that looks like it shares a lot of code with the place you cut-and-pasted it from (decide_to_advertise_dirport())? Leaving these two functions in place together is begging for bugs in the future where one gets out of sync from the other.

5)

+  /* redundant - if DirPort is unreachable, we don't publish a descriptor */
   if (!check_whether_dirport_reachable())

I agree. Let's add a commit to take that check out?

6) It seems we still have a complicated mess of similar sounding functions, directory-permits-this, dirport-set-that, etc. Not something that needs to be fixed for this ticket, but certainly something worth working on for the future!

comment:23 Changed 3 years ago by nickm

Keywords: TorCoreTeam201605 added

These 0.2.8 items really should get handled in May. Or earlier.

comment:24 in reply to:  22 Changed 3 years ago by teor

Actual Points: 16 hours20 hours
Reviewer: dgouletdgoulet, arma

Replying to arma:

1)

\ No newline at end of file

for the changes file will produce surprising results when somebody cats all the changes files together.

A1: 2c9b85d fixup! Changes file for #18616
Also fixes a typo in the changes file name, and rewords one of the changes to make it clearer.

2)

+static int router_reachability_checks_disabled(void)
 {
   const or_options_t *options = get_options();

As a static helper function, it might be a bit cleaner to pass in options to it, especially since one of the two callers already has an 'options' hanging out ready to be passed in. We're certainly not consistent about this, but I think it's the righter thing to do.

It also makes it slightly easier to unit test the function.

A2: a12df41 Refactor common code out of reachability checks

3) I like 004e1c2704 but it looks like it does two things -- first is the refactor, and second is the change where check_whether_orport_reachable() considers net_is_disabled(). "Refactor" commits are easiest to review when they don't also change behavior. This could become two commits, the first adding the net_is_disabled() to check_whether_orport_reachable(), and the second adding the modular function along with a note "no actual changes in behavior" to make it clear that it's just a refactor.

A3:
0a4ac3e Reverts original commit 004e1c2, so I can split it up without a force push
00bb4d7 Avoid checking ORPort reachability when the network is disabled
a12df41 Refactor common code out of reachability checks

3b) Does the behavior change from 004e1c2704 warrant a changelog entry? I think it probably does.

A3b: 00bb4d7 Avoid checking ORPort reachability when the network is disabled

3c) in control.c:

    } else if (!strcmp(question, "status/reachability-succeeded/or")) {
      *answer = tor_strdup(check_whether_orport_reachable() ? "1" : "0");
    } else if (!strcmp(question, "status/reachability-succeeded/dir")) {
      *answer = tor_strdup(check_whether_dirport_reachable() ? "1" : "0");
    } else if (!strcmp(question, "status/reachability-succeeded")) {
      tor_asprintf(answer, "OR=%d DIR=%d",
                   check_whether_orport_reachable() ? 1 : 0,
                   check_whether_dirport_reachable() ? 1 : 0);

are documented as

    "status/reachability-succeeded/or"
      0 or 1, depending on whether we've found our ORPort reachable.
    "status/reachability-succeeded/dir"
      0 or 1, depending on whether we've found our DirPort reachable.
    "status/reachability-succeeded"

I think the "we aren't trying to test reachability on it, so return 1" behavior might be surprising here. On the other hand, the reachability did *succeed*, in that we're not unhappy with the current state of things. Yuck. No need to fix it here but maybe we want to clean this up in the future?

#18851 contains a branch with a control spec patch to clarify this behaviour

4) In commit 4dda75fc0 we end up with a new function decide_to_advertise_begindir(), that looks like it shares a lot of code with the place you cut-and-pasted it from (decide_to_advertise_dirport())? Leaving these two functions in place together is begging for bugs in the future where one gets out of sync from the other.

A4: 4e0e555 Refactor DirPort & begindir descriptor checks

The logic is slightly more complex now, but I think it's better than the alternatives:

  • duplicate code,
  • a macro that expands to the function body
  • a third variable in the refactored function indicating either a DirPort or begindir check

5)

+  /* redundant - if DirPort is unreachable, we don't publish a descriptor */
   if (!check_whether_dirport_reachable())

I agree. Let's add a commit to take that check out?

A5: 1416a28 Remove redundant descriptor checks for OR/Dir reachability

6) It seems we still have a complicated mess of similar sounding functions, directory-permits-this, dirport-set-that, etc. Not something that needs to be fixed for this ticket, but certainly something worth working on for the future!

A6: #18918 Clarify directory and ORPort checking functions

This code passes the unit tests and make test-network-all.

If you don't want to re-review commits that have stayed the same, the unsquashed version is in bug18616-v3

When nickm wants to merge:
bug18616-v3-squashed removes commit 004e1c2 and its revert 0a4ac3e, and rewords the commit message on 4dda75f to remove a redundant sentence

bug18616-v3-rebased is on the latest maint-0.2.8, but there were no conflicts in the rebase

comment:25 Changed 3 years ago by nickm

Calling all non-needs_information tickets for May.

comment:26 Changed 3 years ago by nickm

Keywords: TorCoreTeam-postponed-201604 added; TorCoreTeam201604 removed

April is over; calling these april tickets postponed into may.

comment:27 Changed 3 years ago by andrea

Owner: set to andrea
Status: needs_reviewassigned

comment:28 Changed 3 years ago by teor

Status: assignedneeds_review

comment:29 Changed 3 years ago by nickm

Keywords: review-group-1 added

comment:30 Changed 3 years ago by dgoulet

Reviewer: dgoulet, armaarma

comment:31 Changed 3 years ago by arma

Ok, I looked at bug18616-v3-squashed. Looks good to me.

I have a further cleanup patch on my side that I'm still working on. But I realized I'm confused about what's actually being fixed here.

Here's my in-progress new changes file:

  o Major bugfixes (directory mirrors):
    - Fix broken DirPort self-checks. Decide to advertise begindir
      support the same way we decide to advertise DirPorts.
      Resolves bug 18616; bugfix on 0.2.8.1-alpha. Patch by teor.

  o Minor bugfixes:
    - Add additional config options that might change the content of
      a relay's descriptor. Fixes more of bug 12538; bugfix on 0.2.8.1-alpha.
    - Avoid checking ORPort reachability when the network is disabled.

For the first entry, what exactly was broken about them? I think that's this ticket. What was the actual bug?

And then for the last entry, was that something that could happen to users in practice?

comment:32 in reply to:  31 Changed 3 years ago by teor

Replying to arma:

Ok, I looked at bug18616-v3-squashed. Looks good to me.

I have a further cleanup patch on my side that I'm still working on. But I realized I'm confused about what's actually being fixed here.

Here's my in-progress new changes file:

  o Major bugfixes (directory mirrors):
    - Fix broken DirPort self-checks. Decide to advertise begindir
      support the same way we decide to advertise DirPorts.
      Resolves bug 18616; bugfix on 0.2.8.1-alpha. Patch by teor.

  o Minor bugfixes:
    - Add additional config options that might change the content of
      a relay's descriptor. Fixes more of bug 12538; bugfix on 0.2.8.1-alpha.
    - Avoid checking ORPort reachability when the network is disabled.

For the first entry, what exactly was broken about them? I think that's this ticket. What was the actual bug?

DirPort self-checks were failing with error messages. They also weren't being scheduled and checked correctly. We fixed some of it in #18623, this is the cleanup / remainder.

And then for the last entry, was that something that could happen to users in practice?

Yes, if they did a HUP after changing any of those options, it would be 18 hours before they were reflected in the descriptor. This would be a problem if they added a quota and wanted the DirPort to stop working soon.

Edit: fix ticket number

Last edited 3 years ago by teor (previous) (diff)

comment:33 Changed 3 years ago by teor

Yes, if a relay operator sets DisableNetwork while the user updates the config, we shouldn't be testing ORPort reachability. I think it was still possible to do that.

comment:34 Changed 3 years ago by teor

Description: modified (diff)
Summary: Relay fails to self-test its DirPort with AccountingMax enabledMake begindir advertise checks consistent with DirPort checks

arma asked me to re-title the ticket and fix the description

comment:35 Changed 3 years ago by arma

Description: modified (diff)

I've deleted the distracting lines while I try to wrap my head around this.

comment:36 Changed 3 years ago by arma

Here are further hints from teor about what's going on in this patch:

<teor> no, I mean we have the wrong checks
<teor> the check used to be ri->supports_tunnelled_dir_requests =
dir_server_mode(options) && router_should_be_directory_server(options,
ri->dir_port);
<teor> now it's directory_permits_begindir_requests(), which is
options->BridgeRelay != 0 || dir_server_mode(options)
<teor> that fixes a bridge issue with advertising begindir support
<teor> then we also replace router->supports_tunnelled_dir_requests with
decide_to_advertise_begindir(options, router->supports_tunnelled_dir_requests)
<teor> which fixes a whole heap of checks that were being done for DirPorts,
but not for begindir

comment:37 Changed 3 years ago by teor

"0.2.8.1-alpha bridges were not advertising begindir support", but:

  • entry_guard_set_status() has a specific bridge check
  • bridge check in add_an_entry_guard
  • no bridge check in router_pick_directory_server_impl, but it's made up for by a second check in add_an_entry_guard

To summarise #18616, the missing bridge begindir support was a bug waiting to happen, compensated for by conditionals scattered through the code

And the other missing checks could have caused subtle bugs too

comment:38 Changed 3 years ago by arma

My bug18616-v4 branch has some fixes on the changes file, as well as some comment fixes, as well as a bit of refactoring to pass options around a bit more.

If you like these changes, I think we're all set here.

comment:39 in reply to:  38 Changed 3 years ago by teor

Actual Points: 20 hours25 hours
Status: needs_reviewmerge_ready

Replying to arma:

My bug18616-v4 branch has some fixes on the changes file, as well as some comment fixes, as well as a bit of refactoring to pass options around a bit more.

If you like these changes, I think we're all set here.

Looks good, let's get arma's bug18616-v4 merged to 0.2.8 to fix #12538, which is already in 0.2.8.

comment:40 Changed 3 years ago by arma

Owner: changed from andrea to teor
Status: merge_readyassigned

making the owner (writer of patch) accurate

comment:41 Changed 3 years ago by arma

Status: assignedneeds_review

comment:42 Changed 3 years ago by arma

Status: needs_reviewmerge_ready

comment:43 Changed 3 years ago by isabela

Points: medium3

comment:44 Changed 3 years ago by nickm

Reviewer: armaarma,teor
Status: merge_readyneeds_review

There is a merge conflict in router.c between this code and code that was introduced in b8b5bccfd9f350c for #19003.

Before I push this, I'd like one of you to review my merge commit in branch "bug18616-v4-merged_028" to make sure that I resolved the conflict correctly, and that we won't be creating more problems here.

I have confirmed that it compiles and the unit tests still pass. Beyond that, who can say.

comment:45 Changed 3 years ago by arma

Looks like we just added (options) to both of these lines. Looks fine to me.

comment:46 Changed 3 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay, merged. Thanks!

(Closing, and removing #18852 as a child ticket.)

comment:47 in reply to:  33 Changed 3 years ago by arma

Replying to teor:

Yes, if a relay operator sets DisableNetwork while the user updates the config, we shouldn't be testing ORPort reachability. I think it was still possible to do that.

For posterity, I found another edge case like this one while working on this patch, and filed it under #19069.

Note: See TracTickets for help on using tickets.