Opened 9 years ago

Closed 9 years ago

Last modified 7 years ago

#2716 closed defect (fixed)

When we conclude a relay is unreachable, we give it free uptime

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.2.2.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-auth
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

In researching #2714 (for #2709), I noticed:

Mar 10 15:44:21.000 [info] dirserv_orconn_tls_done(): Found router MesBotEU1 to be reachable. Yay.
Mar 10 16:24:04.000 [info] run_connection_housekeeping(): Expiring non-open OR connection to fd 1245 (78.47.251.152:667).
Mar 10 16:44:40.000 [info] run_connection_housekeeping(): Expiring non-open OR connection to fd 990 (78.47.251.152:667).
Mar 10 16:50:01.000 [info] rep_hist_note_router_unreachable(): Router EABCA5F5D71D926C4A425E09C8C7F3AA46850EF6 is now non-Running: it had previously been Running since 2011-03-09 19:41:41.  Its total weighted uptime is 1412377/1426655.
Mar 10 17:02:48.000 [info] dirserv_orconn_tls_done(): Found router MesBotEU1 to be reachable. Yay.
Mar 10 17:02:48.000 [info] rep_hist_note_router_reachable(): Router EABCA5F5D71D926C4A425E09C8C7F3AA46850EF6 is now Running; it had been down since 2011-03-10 16:50:01.

We do a reachability test every 21.3 minutes, so with REACHABLE_TIMEOUT at 45 minutes, that means you can fail one or two reachability tests and still get counted up. Fine.

But when you fail *more* than that, and we conclude you're down, should we assume you were up for the entire grace period?

Seems like we should conclude you went (retroactively) down at the first failed test -- 16:24:04 in this case.

While I'm at it, what happened to the reachability test around 16:04? I couldn't find any evidence of it in my logs. (But there are millions of lines here, so maybe I just didn't look carefully enough.)

(For the record, MesBotEU1 thinks it was up the whole time.)

Child Tickets

Change History (9)

comment:1 in reply to:  description ; Changed 9 years ago by arma

Replying to arma:

Seems like we should conclude you went (retroactively) down at the first failed test -- 16:24:04 in this case.

A patch for that approach might look like

diff --git a/src/or/dirserv.c b/src/or/dirserv.c
index aeeab45..6f81b31 100644
--- a/src/or/dirserv.c
+++ b/src/or/dirserv.c
@@ -970,7 +970,8 @@ dirserv_set_router_is_running(routerinfo_t *router, time_t n
 
   if (!answer && running_long_enough_to_decide_unreachable()) {
     /* not considered reachable. tell rephist. */
-    rep_hist_note_router_unreachable(router->cache_info.identity_digest, now);
+    rep_hist_note_router_unreachable(router->cache_info.identity_digest,
+      router->last_reachable ? router->last_reachable + 1280 : now);
   }
 
   router->is_running = answer;

except cleaned up to have 10 and 128 (both from main.c) be #defined constants.

There are other calls to rep_hist_note_router_unreachable(), including the one we just put in for bug #1035, but they don't bother me much.

comment:2 in reply to:  1 Changed 9 years ago by arma

Replying to arma:

+ rep_hist_note_router_unreachable(router->cache_info.identity_digest,
+ router->last_reachable ? router->last_reachable + 1280 : now);

Double extra bonus points if we don't let router->last_reachable + 1280 be later than now. :)

comment:3 Changed 9 years ago by arma

Milestone: Tor: 0.2.2.x-final
Status: newneeds_review

See my bug2716 branch for a possible patch.

comment:4 Changed 9 years ago by nickm

The logic in dirserv_set_router_is_running really needs a comment; it is fairly subtle. (In fact, I wasn't sure it was right till I verified that yes, we do keep launching reachability tests to routers even if we're already connected to them.)

Multiplying by ten to define REACHABILITY_TEST_PERIOD then dividing by REACHABILITY_TEST_PERIOD to recover "10" seems iffy. The 10 should probably be a constant too.

While we're touching this, some other cleanups I'd suggest:

REACHABILITY_MODULO_PER_TEST should probably turn into a maximum number of relays to test at a time, not a maximum fraction of the keyspace.

The code in main.c that invokes dirserv_test_reachability() uses an "if now % interval == 0" pattern that can skip intervals if we miss a second or double-up if we trigger twice in a second (less likely). Elsewhere we use "if (last_did_foo + foo_interval < now) { last_did_foo = now; ... }" as our periodic timer pattern in that function.

comment:5 Changed 9 years ago by nickm

see tweaks in my bug2716 branch. I didn't change the "1/128" thing, but maybe in 0.2.3 we should think about it.

comment:6 Changed 9 years ago by arma

See my tweaks to your tweaks in my bug2716 branch. Looks good to me now.

comment:7 Changed 9 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

merged & closing; thanks!

comment:8 Changed 7 years ago by nickm

Keywords: tor-auth added

comment:9 Changed 7 years ago by nickm

Component: Tor Directory AuthorityTor
Note: See TracTickets for help on using tickets.