HS intro circuit retry logic fails when network interface is down

changed milestone to %Tor: unspecified

added bootstrap component::core tor/tor milestone::Tor: unspecified network-down parent::16387 points::1.5 priority::medium severity::normal status::needs-revision tor-hs tor-retry type::defect labels

Suggested fix #1:

In the snippet above, if rend_service_launch_establish_intro() fails, we don't remove the intro point from service->intro_nodes or free it. Instead, we just continue to the next intro point. This way we will keep on retrying the old intro points every second, till the network comes back up.

The main assumption in the above fix is that rend_service_launch_establish_intro() can only fail for local reasons (bugs, OOM, interface down, etc.). If this is the case, then there is no reason to blame the intro point for this failure and we can just keep on retrying it till the local issues get resolved.

We should verify the assumption above before implementing the suggested fix.

Trac:
Username: stephan
Cc: dgoulet, special, timonh to dgoulet, special, timonh, stephan

Replying to asn:

Suggested fix #1:

In the snippet above, if rend_service_launch_establish_intro() fails, we don't remove the intro point from service->intro_nodes or free it. Instead, we just continue to the next intro point. This way we will keep on retrying the old intro points every second, till the network comes back up.

Your fix seems to work for me. I switched the network interface and ended up with the same intro points. See the attached log.

The main assumption in the above fix is that rend_service_launch_establish_intro() can only fail for local reasons (bugs, OOM, interface down, etc.). If this is the case, then there is no reason to blame the intro point for this failure and we can just keep on retrying it till the local issues get resolved.

We should verify the assumption above before implementing the suggested fix.

If the assumption can be verified it would be nice to have the fix soon.

Trac:
Username: timonh

torifaceswitch.log

Trac:
Username: timonh

Thanks for the testing timonh.

I pushed a branch bug19522 in my repo, so that people can test further: https://gitweb.torproject.org/user/asn/tor.git

FWIW, I read some code to verify the assumption of comment:1 and it seems to be accurate. But more digging and testing is required to get better confidence, as those functions were quite hairy.

Trac:
Status: new to needs_review

Replying to asn:

Thanks for the testing timonh.

I pushed a branch bug19522 in my repo, so that people can test further: https://gitweb.torproject.org/user/asn/tor.git

FWIW, I read some code to verify the assumption of comment:1 and it seems to be accurate. But more digging and testing is required to get better confidence, as those functions were quite hairy.

The fix seems logical but adds some powerful assumptions that rend_service_launch_establish_intro() failure is always due to some "local issues". As far as I can tell, it seems to be the case that basically if we can't launch a circuit it's because we just can't get packet out of the wire or we simply don't have enough information to be able to do so (consensus for instance).

If you could, adding a comment to rend_service_launch_establish_intro() documenting the returned value and in this case the -1 being that it doesn't mean the intro point is bad per-se but rather the failing of launching a circuit is due to "local reachability" or not enough information to continue issues. This is very important that we don't mess it up I believe because retrying over and over a bad intro point is also a bad thing.

Trac:
Status: needs_review to needs_revision

Replying to dgoulet:

Replying to asn:

Thanks for the testing timonh.

I pushed a branch bug19522 in my repo, so that people can test further: https://gitweb.torproject.org/user/asn/tor.git

FWIW, I read some code to verify the assumption of comment:1 and it seems to be accurate. But more digging and testing is required to get better confidence, as those functions were quite hairy.

The fix seems logical but adds some powerful assumptions that rend_service_launch_establish_intro() failure is always due to some "local issues". As far as I can tell, it seems to be the case that basically if we can't launch a circuit it's because we just can't get packet out of the wire or we simply don't have enough information to be able to do so (consensus for instance).

If you could, adding a comment to rend_service_launch_establish_intro() documenting the returned value and in this case the -1 being that it doesn't mean the intro point is bad per-se but rather the failing of launching a circuit is due to "local reachability" or not enough information to continue issues. This is very important that we don't mess it up I believe because retrying over and over a bad intro point is also a bad thing.

Hmm, as discussed on IRC, I believe that rend_service_launch_establish_intro() will return -1 for local issues in almost all cases, but there are cases where it could in theory return -1 for remote issues.

Specifically, you can reach connection_or_connect() from that function which will eventually call connect(2), which can in theory fail with remote errors such as ECONNREFUSED or ETIMEDOUT. I'm pretty sure this can't happen in actual remote networks (since the socket API is asynch), but there is no way to be sure.

The good thing here, is that IIUC the only time we would want to remove an intro point in rend_consider_services_intro_points() is only when connect() fails with the above remote errors. In all the other cases, we would want to keep the intro point since it's just local errors. Unfortunately, the retval of connect() is not exposed in rend_consider_services_intro_points().

A more correct patch here would involve exposing the connect() retval in that function and only removing the intro point if it's a remote error code. This does not seem like a trivial patch here, but also not too hard to do.

Milestone renamed

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.3.???

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

Trac:
Milestone: Tor: 0.3.??? to Tor: unspecified
Keywords: N/A deleted, tor-03-unspecified-201612 added

Remove an old triaging keyword.

Trac:
Keywords: tor-03-unspecified-201612 deleted, N/A added

Trac:
Sponsor: SponsorR-can to N/A

Trac:
Keywords: tor-hs deleted, tor-hs tor-retry bootstrap network-down added

Trac:
Sponsor: N/A to Sponsor8-can

Trac:
Sponsor: Sponsor8-can to N/A

changed time estimate to 12h

moved to tpo/core/tor#19522 (closed)

HS intro circuit retry logic fails when network interface is down

Child items 0

Activity