Opened 2 years ago

Last modified 13 months ago

#19522 needs_revision defect

HS intro circuit retry logic fails when network interface is down

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs tor-retry bootstrap network-down
Cc: dgoulet, special, timonh, stephan Actual Points:
Parent ID: #16387 Points: 1.5
Reviewer: Sponsor: Sponsor8-can

Description

During investigations in #16387, we found out that mobile HSes won't retry their intro point circuits when their network interface is down.

This is a problem since mobile devices change their IP address all the time, and their network interface toggles on and off. Since the retry logic fails in this case, the mobile HS ends up rotating intro points everytime the interface goes up.

The problem is this chunk of code in rend_consider_services_intro_points():

    /* Let's try to rebuild circuit on the nodes we want to retry on. */
    SMARTLIST_FOREACH_BEGIN(retry_nodes, rend_intro_point_t *, intro) {
      r = rend_service_launch_establish_intro(service, intro);
      if (r < 0) {
        log_warn(LD_REND, "Error launching circuit to node %s for service %s.",
                 safe_str_client(extend_info_describe(intro->extend_info)),
                 safe_str_client(service->service_id));
        /* Unable to launch a circuit to that intro point, remove it from
         * the valid list so we can create a new one. */
        smartlist_remove(service->intro_nodes, intro);
        rend_intro_point_free(intro);
        continue;
      }
      intro->circuit_retries++;
    } SMARTLIST_FOREACH_END(intro);

When our interface is down, rend_service_launch_establish_intro() will fail immediately since it eventually calls connection_or_connect() which leads to a connect() call with no network.

As you can see from the code above, when that function fails immediately, we remove the intro point no questions asked, with no subsequent retries.

This is problematic, since the failure there is because of our local network, and does not indicate any issues with the intro point, so we should not ditch it so easily. Ideally, as long as we know that the error is local, we should probably keep on trying that same intro point.

Child Tickets

Attachments (1)

torifaceswitch.log (5.5 KB) - added by timonh 2 years ago.

Download all attachments as: .zip

Change History (13)

comment:1 Changed 2 years ago by asn

Suggested fix #1:

In the snippet above, if rend_service_launch_establish_intro() fails, we don't remove the intro point from service->intro_nodes or free it. Instead, we just continue to the next intro point. This way we will keep on retrying the old intro points every second, till the network comes back up.

The main assumption in the above fix is that rend_service_launch_establish_intro() can only fail for local reasons (bugs, OOM, interface down, etc.). If this is the case, then there is no reason to blame the intro point for this failure and we can just keep on retrying it till the local issues get resolved.

We should verify the assumption above before implementing the suggested fix.

comment:2 Changed 2 years ago by stephan

Cc: stephan added

comment:3 in reply to:  1 Changed 2 years ago by timonh

Replying to asn:

Suggested fix #1:

In the snippet above, if rend_service_launch_establish_intro() fails, we don't remove the intro point from service->intro_nodes or free it. Instead, we just continue to the next intro point. This way we will keep on retrying the old intro points every second, till the network comes back up.

Your fix seems to work for me. I switched the network interface and ended up with the same intro points. See the attached log.

The main assumption in the above fix is that rend_service_launch_establish_intro() can only fail for local reasons (bugs, OOM, interface down, etc.). If this is the case, then there is no reason to blame the intro point for this failure and we can just keep on retrying it till the local issues get resolved.

We should verify the assumption above before implementing the suggested fix.

If the assumption can be verified it would be nice to have the fix soon.

Changed 2 years ago by timonh

Attachment: torifaceswitch.log added

comment:4 Changed 2 years ago by asn

Status: newneeds_review

Thanks for the testing timonh.

I pushed a branch bug19522 in my repo, so that people can test further:
https://gitweb.torproject.org/user/asn/tor.git

FWIW, I read some code to verify the assumption of comment:1 and it seems to be accurate. But more digging and testing is required to get better confidence, as those functions were quite hairy.

comment:5 in reply to:  4 ; Changed 2 years ago by dgoulet

Status: needs_reviewneeds_revision

Replying to asn:

Thanks for the testing timonh.

I pushed a branch bug19522 in my repo, so that people can test further:
https://gitweb.torproject.org/user/asn/tor.git

FWIW, I read some code to verify the assumption of comment:1 and it seems to be accurate. But more digging and testing is required to get better confidence, as those functions were quite hairy.

The fix seems logical but adds some _powerful_ assumptions that rend_service_launch_establish_intro() failure is always due to some "local issues". As far as I can tell, it seems to be the case that basically if we can't launch a circuit it's because we just can't get packet out of the wire or we simply don't have enough information to be able to do so (consensus for instance).

If you could, adding a comment to rend_service_launch_establish_intro() documenting the returned value and in this case the -1 being that it doesn't mean the intro point is _bad_ per-se but rather the failing of launching a circuit is due to "local reachability" or not enough information to continue issues. This is _very_ important that we don't mess it up I believe because retrying over and over a bad intro point is also a bad thing.

comment:6 in reply to:  5 Changed 2 years ago by asn

Replying to dgoulet:

Replying to asn:

Thanks for the testing timonh.

I pushed a branch bug19522 in my repo, so that people can test further:
https://gitweb.torproject.org/user/asn/tor.git

FWIW, I read some code to verify the assumption of comment:1 and it seems to be accurate. But more digging and testing is required to get better confidence, as those functions were quite hairy.

The fix seems logical but adds some _powerful_ assumptions that rend_service_launch_establish_intro() failure is always due to some "local issues". As far as I can tell, it seems to be the case that basically if we can't launch a circuit it's because we just can't get packet out of the wire or we simply don't have enough information to be able to do so (consensus for instance).

If you could, adding a comment to rend_service_launch_establish_intro() documenting the returned value and in this case the -1 being that it doesn't mean the intro point is _bad_ per-se but rather the failing of launching a circuit is due to "local reachability" or not enough information to continue issues. This is _very_ important that we don't mess it up I believe because retrying over and over a bad intro point is also a bad thing.

Hmm, as discussed on IRC, I believe that rend_service_launch_establish_intro() will return -1 for local issues in almost all cases, but there are cases where it could in theory return -1 for remote issues.

Specifically, you can reach connection_or_connect() from that function which will eventually call connect(2), which can in theory fail with remote errors such as ECONNREFUSED or ETIMEDOUT. I'm pretty sure this can't happen in actual remote networks (since the socket API is asynch), but there is no way to be sure.

The good thing here, is that IIUC the only time we would want to remove an intro point in rend_consider_services_intro_points() is only when connect() fails with the above remote errors. In all the other cases, we would want to keep the intro point since it's just local errors. Unfortunately, the retval of connect() is not exposed in rend_consider_services_intro_points().

A more correct patch here would involve exposing the connect() retval in that function and only removing the intro point if it's a remote error code. This does not seem like a trivial patch here, but also not too hard to do.

comment:7 Changed 21 months ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:8 Changed 20 months ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:9 Changed 15 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:10 Changed 15 months ago by dgoulet

Sponsor: SponsorR-can

comment:11 Changed 14 months ago by nickm

Keywords: tor-retry bootstrap network-down added

comment:12 Changed 13 months ago by nickm

Sponsor: Sponsor8-can
Note: See TracTickets for help on using tickets.