Opened 11 years ago

Last modified 7 years ago

#969 closed defect (Fixed)

Directory authorities have different opinions on MTBF and WFU

Reported by: karsten Owned by: karsten
Priority: Low Milestone:
Component: Core Tor/Tor Version: 0.2.1.14-rc
Severity: Keywords:
Cc: karsten, nickm, Sebastian, arma Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

It has turned out that directory authorities have very different opinions
on relays' MTBF (mean time between failure) and WFU (weighted fractional
uptime). The result is that they vote differently on Guard and Stable
flags:

http://freehaven.net/~karsten/metrics/relayflags-2009-04-01.pdf

One reason might be false assumptions about running relays as reflected in
the router-stability files. If a relay is running, the corresponding MTBF
line contains the starting time. The starting time is used to include the
running session in MTBF and WFU calculation. An analysis of three
router-stability files shows that authorities think there are between 6K
and 24K relays currently running, which is wrong:

$ grep "MTBF" ides-2009-04-14 | grep "S=" | wc -l
24082
$ grep "MTBF" gabelmoo-2009-04-15 | grep "S=" | wc -l
9206
$ grep "MTBF" moria1-2009-04-15 | grep "S=" | wc -l
6395

These lines are never removed from router-stability files, so that whenever
these relays come back, they appear to be uber-stable which they of course
are not.

The problem lies in the fact that this starting time is only set to 0 in a
few edge cases using rep_hist_note_router_unreachable() in rephist.c. This
function should be called whenever a relay has gone offline, which is of
course difficult to know.

As a possible solution, Tor could check during maintenance when a relay was
contacted the last time. If this time lies more than twice the reachability
timeout in the past, the relay should be marked as unreachable in
rephist.c, too. A simple patch (with some code duplication from
dirserv_set_router_is_running() in dirserv.c) would look like this:

Index: src/or/rephist.c
===================================================================
--- src/or/rephist.c (revision 19341)
+++ src/or/rephist.c (working copy)
@@ -658,6 +658,22 @@

digestmap_iter_get(orhist_it, &d1, &or_history_p);
or_history = or_history_p;


+#define DOUBLE_REACHABLE_TIMEOUT (2*45*60)
+ /* If we are an authority, check if this router is still running. */
+ if (authority && !or_history->start_of_run) {
+ char time_buf[ISO_TIME_LEN+1];
+ routerinfo_t *router = router_get_by_digest(d1);

+ if (!router
(router_is_me(router) && we_are_hibernating())

+ (!get_options()->AssumeReachable &&
+ before >= router->last_reachable + DOUBLE_REACHABLE_TIMEOUT)) {
+ format_iso_time(time_buf, before);
+ log_info(LD_DIR, "When cleaning the reputation history at %s, "
+ "we found that router %s is not running anymore.",
+ time_buf, hex_str(d1, DIGEST_LEN));
+ rep_hist_note_router_unreachable(d1, before);
+ }
+ }
+ /* Now decide if we want to keep it. */

remove = authority ? (or_history->total_run_weights < STABILITY_EPSILON &&

!or_history->start_of_run)

: (or_history->changed < before);

[Automatically added by flyspray2trac: Operating System: All]

Child Tickets

Change History (7)

comment:1 Changed 11 years ago by arma

See also bug 696, which has early info on this same bug (I think).

comment:2 Changed 11 years ago by karsten

These are some results from a discussion on #tor-dev on May 18, 2009:

There might be better approaches than iterating over the whole routerlist
periodically to mark routers as unreachable:

  • When routers are removed from the routerlist, they should be marked as

unreachable in the rephist, too. See routerlist_remove() in routerlist.c.

  • When generating a consensus, all relays that an authority thinks are

unreachable are marked as such in the rephist. The requirement
"if (router->is_running && !answer) {" in dirserv.c would be relaxed to
"if (!answer) {".

  • Before writing the router-stability file, the list of routers in the

rephist would be compared with the routerlist. Those routers which the
authority thinks are unreachable are updated in the rephist to be
unreachable. The idea of performing this check before writing the
router-stability file is that the authority must have been running for 30
minutes at that point, collecting up-to-date information about relay
availability. In any way, this check needs to be done in a non-O(n^2) way.

comment:3 Changed 10 years ago by arma

Check out f0e3834d4aa669881026edddfb2b334dc7543b35 in my 969 branch on
git.torproject.org/~arma/git/tor.git

I implemented all three of the above plans.

Alas, my directory authority is busy [gathering stats about] saving the
world, so I haven't tested any of it yet. If you do test, you'll want to
back up your router-stability file in case it gets infested with gremlins
or whatever.

comment:4 Changed 10 years ago by arma

I just tested my patch by setting up moriatest as a v3 authority, and
running the patch. After a half hour, it wrote a new router-stability
file that looks a lot better.

Before:
$ grep MTBF router-stability |grep S=|wc -l
7242

After:
$ grep MTBF router-stability |grep S=|wc -l
1833

I think we should put this patch in 0.2.1.17-rc. It's a big bugfix.

comment:5 Changed 10 years ago by nickm

Branch merged into 0.2.1.

comment:6 Changed 10 years ago by nickm

flyspray2trac: bug closed.

comment:7 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.