Opened 8 months ago

Closed 5 months ago

#25415 closed defect (fixed)

moria1 seg faults on testing relay reachability

Reported by: arma Owned by: nickm
Priority: High Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version: Tor: 0.3.3.2-alpha
Severity: Blocker Keywords: 033-backport, tor-dirauth, crash, 033-must, review-group-34
Cc: Actual Points:
Parent ID: Points: 0.5
Reviewer: Sponsor:

Description

[...]
Mar 03 15:09:16.138 [notice] This version of Tor (0.3.3.3-alpha-dev) is newer than any recommended version, according to the directory authorities. Recommended versions are: 0.2.5.16,0.2.5.17,0.2.9.14,0.2.9.15,0.3.1.9,0.3.1.10,0.3.2.8-rc,0.3.2.9,0.3.2.10,0.3.3.1-alpha,0.3.3.2-alpha,0.3.3.3-alpha

============================================================ T= 1520107762
Tor 0.3.3.3-alpha-dev (git-6d44cf66c7cca6e0) died: Caught signal 11
../git/src/or/tor(+0x18d2aa)[0x7f7b748152aa]
../git/src/or/tor(tor_memeq+0x27)[0x7f7b748355e7]
../git/src/or/tor(tor_memeq+0x27)[0x7f7b748355e7]
../git/src/or/tor(router_ed25519_id_is_me+0x2a)[0x7f7b7472472a]
../git/src/or/tor(connection_or_connect+0x1a8)[0x7f7b747a1ad8]
../git/src/or/tor(channel_tls_connect+0xaa)[0x7f7b74758aaa]
../git/src/or/tor(dirserv_single_reachability_test+0xa8)[0x7f7b747c4108]
../git/src/or/tor(routerlist_descriptors_added+0x90)[0x7f7b7472b0a0]
../git/src/or/tor(router_load_routers_from_string+0x438)[0x7f7b747336a8]
../git/src/or/tor(+0xab81c)[0x7f7b7473381c]
../git/src/or/tor(router_reload_router_list+0x26)[0x7f7b74733926]
../git/src/or/tor(do_main_loop+0x2fb)[0x7f7b746db17b]
../git/src/or/tor(tor_run_main+0x29b)[0x7f7b746db88b]
../git/src/or/tor(tor_main+0x43)[0x7f7b746d61a3]
../git/src/or/tor(main+0x19)[0x7f7b746d6039]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f7b731f8d1d]
../git/src/or/tor(+0x4df49)[0x7f7b746d5f49]
Aborted   
#0  0x00007ffff7ca55e7 in tor_memeq (a=<value optimized out>,
    b=<value optimized out>, sz=<value optimized out>)
    at src/common/di_ops.c:110
#1  0x00007ffff7b9472a in router_ed25519_id_is_me (id=<value optimized out>)
    at src/or/routerkeys.c:1248
#2  0x00007ffff7c11ad8 in connection_or_connect (_addr=0x7fffffffe180,
    port=443, id_digest=0x7fffeb96c8ac "\fÚ0ÍÛRku>K§\t8p;ïl´¸áv\237\232Z",
    ed_id=0x20, chan=0x7ffffecb8230) at src/or/connection_or.c:1210
#3  0x00007ffff7bc8aaa in channel_tls_connect (addr=0x7fffffffe180, port=443,
    id_digest=0x7fffeb96c8ac "\fÚ0ÍÛRku>K§\t8p;ïl´¸áv\237\232Z", ed_id=0x20)
    at src/or/channeltls.c:206
#4  0x00007ffff7c34108 in dirserv_single_reachability_test (
    now=<value optimized out>, router=0x7fffeb96c880) at src/or/dirserv.c:3405
#5  0x00007ffff7b9b0a0 in routerlist_descriptors_added (sl=0x7ffffb9b2900,
    from_cache=1) at src/or/routerlist.c:4280
#6  0x00007ffff7ba36a8 in router_load_routers_from_string (
    s=0x7fffee9720f8 "", eos=<value optimized out>,
    saved_location=<value optimized out>, requested_fingerprints=0x0,
    descriptor_digests=0, prepend_annotations=<value optimized out>)
    at src/or/routerlist.c:4415
#7  0x00007ffff7ba381c in router_reload_router_list_impl (store=0x7ffff81306c0)
    at src/or/routerlist.c:1565
#8  0x00007ffff7ba3926 in router_reload_router_list ()
    at src/or/routerlist.c:1607
#9  0x00007ffff7b4b17b in do_main_loop () at src/or/main.c:2671
#10 0x00007ffff7b4b88b in tor_run_main (tor_cfg=<value optimized out>)
    at src/or/main.c:4064
#11 0x00007ffff7b461a3 in tor_main (argc=3, argv=0x7fffffffe638)
    at src/or/tor_api.c:84
#12 0x00007ffff7b46039 in main (argc=<value optimized out>,
    argv=<value optimized out>) at src/or/tor_main.c:32
#0  0x00007ffff7ca55e7 in tor_memeq (a=<value optimized out>, 
    b=<value optimized out>, sz=<value optimized out>)
    at src/common/di_ops.c:110
110         const uint8_t byte_diff = *ba++ ^ *bb++;
(gdb) up
#1  0x00007ffff7b9472a in router_ed25519_id_is_me (id=<value optimized out>)
    at src/or/routerkeys.c:1248
1248        ed25519_pubkey_eq(id, &master_identity_key->pubkey);
(gdb) up
#2  0x00007ffff7c11ad8 in connection_or_connect (_addr=0x7fffffffe180, 
    port=443, id_digest=0x7fffeb96c8ac "\fÚ0ÍÛRku>K§\t8p;ïl´¸áv\237\232Z", 
    ed_id=0x20, chan=0x7ffffecb8230) at src/or/connection_or.c:1210
1210      if (server_mode(options) && router_ed25519_id_is_me(ed_id)) {
(gdb) print ed_id
$4 = (const ed25519_public_key_t *) 0x20

Child Tickets

Change History (22)

comment:1 Changed 8 months ago by arma

(gdb) up
#4  0x00007ffff7c34108 in dirserv_single_reachability_test (
    now=<value optimized out>, router=0x7fffeb96c880) at src/or/dirserv.c:3405
3405      chan = channel_tls_connect(&router_addr, router->or_port,
(gdb) print ed_id_key
$5 = (const ed25519_public_key_t *) 0x20
(gdb) print router->cache_info.signing_key_cert
$6 = (struct tor_cert_st *) 0x0

comment:2 Changed 8 months ago by arma

Mar 03 15:35:52.252 [info] dirserv_router_get_status(): Descriptor from router $0CDA30CDDB526B753E4BA70938703BEF6CB4B8E1~matress at 94.16.123.176 has no Ed25519 key, when we previously knew an Ed25519 for it. Ignoring for now, since Ed25519 keys are fairly new.
(gdb) print *router
[...]
  omit_from_vote = 0, pv = {protocols_known = 1, supports_extend2_cells = 1,
    supports_ed25519_link_handshake_compat = 0,
    supports_ed25519_link_handshake_any = 0, supports_ed25519_hs_intro = 0,

So node_supports_ed25519_link_authentication() ought to be returning 0.

comment:3 Changed 8 months ago by arma

I added a log message, and node_supports_ed25519_link_authentication() is indeed returning 1.

Theory: *this* descriptor doesn't have an ed25519 key, but a previous one did. So in dirserv_single_reachability_test(), "router" doesn't have an ed25519 key, but node_supports_ed25519_link_authentication(node) doesn't look at router, it looks at node, and node *does* have an ed25519 key, from a previous descriptor that we learned it from.

comment:4 Changed 8 months ago by arma

Confirmed. I added this patch:

diff --git a/src/or/dirserv.c b/src/or/dirserv.c
index 0f47a83..0a9420f 100644
--- a/src/or/dirserv.c
+++ b/src/or/dirserv.c
@@ -3393,6 +3393,7 @@ dirserv_single_reachability_test(time_t now, routerinfo_t 
 
   if (options->AuthDirTestEd25519LinkKeys &&
       node_supports_ed25519_link_authentication(node, 1)) {
+    tor_assert(router->cache_info.signing_key_cert);
     ed_id_key = &router->cache_info.signing_key_cert->signing_key;
   } else {
     ed_id_key = NULL;

and it triggers the assert.

(gdb) up
#2  0x00007ffff7c34316 in dirserv_single_reachability_test (
    now=<value optimized out>, router=0x7fffeb96c880) at src/or/dirserv.c:3396
3396        tor_assert(router->cache_info.signing_key_cert);
(gdb) print router->cache_info
$1 = {signed_descriptor_body = 0x0, annotations_len = 57,
  signed_descriptor_len = 1648,
  signed_descriptor_digest = "õÃ\200\227:Ë\226\225\217Õv\035Pk¯­\215\202³ó",
  identity_digest = "\fÚ0ÍÛRku>K§\t8p;ïl´¸á", published_on = 1520082806,
  extra_info_digest = "\216\214·¬ Æ1Su}m\203ÆÅVÔÑõ\217¤",
  extra_info_digest256 = '\000' <repeats 31 times>, signing_key_cert = 0x0,
  ei_dl_status = {next_attempt_at = 0, n_download_failures = 0 '\000',
    n_download_attempts = 0 '\000', schedule = DL_SCHED_GENERIC,
    want_authority = DL_WANT_ANY_DIRSERVER,
    increment_on = DL_SCHED_INCREMENT_FAILURE,
    last_backoff_position = 0 '\000', last_delay_used = 0},
  saved_location = SAVED_IN_CACHE, saved_offset = 39789237,
  routerlist_index = 324, last_listed_as_valid_until = 0, do_not_cache = 0,
  is_extrainfo = 0, extrainfo_is_bogus = 0, send_unencrypted = 1}

comment:5 Changed 8 months ago by arma

Yep, I edited node_supports_ed25519_link_authentication() to print out what it's returning, and it is indeed returning 1 because supports_ed25519_link_handshake_compat=1.

ahf points to commit 7e504515b3 which I think is the right commit to look at.

comment:6 Changed 8 months ago by arma

Milestone: Tor: 0.3.2.x-final
Version: Tor: 0.3.3.2-alpha

Looks like bug #25008 went into 0.3.3.2-alpha.

comment:7 Changed 8 months ago by arma

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

comment:8 Changed 8 months ago by arma

Ok, 0.3.2.10 runs ok. Sounds like it's an 0.3.3 bug.

comment:9 Changed 8 months ago by teor

Keywords: tor-dirauth crash 033-must added
Points: 0.5
Priority: MediumHigh
Severity: NormalBlocker

We can't release 0.3.3 with an authority crash bug.

comment:10 Changed 8 months ago by nickm

Owner: set to nickm
Status: newaccepted

comment:11 Changed 8 months ago by nickm

Status: acceptedneeds_review

Likely fix in branch bug25415.

comment:12 Changed 8 months ago by teor

Looks good to me, I don't think chutney will catch this issue, so let's ask arma to test it?

comment:13 Changed 8 months ago by arma

moria1 now boots happily using git master -- without this change.

When the bug was happening, I made a backup copy of my datadir... but moria1 now happily runs with that old datadir too.

So, it looks like a fun heisenbug, and I won't be able to confirm that the patch fixes it. :(

The patch looks plausible though.

comment:14 Changed 8 months ago by nickm

Keywords: review-group-34 added

comment:15 Changed 8 months ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay. I'm merging this to 0.3.3. But let's reopen if this doesn't go away, or if it shows up in 0.3.2, or anything like that.

comment:16 Changed 7 months ago by arma

Ok, I just hit this on moria1 again (after it suffered a sad power loss mid thought). I started moria1 three times and it died quickly each time.

Then I did a git pull to a later version that has this patch, and restarted, and it did not die.

I'm calling that a good sign that this bug is fixed.

comment:17 Changed 5 months ago by arma

Resolution: fixed
Status: closedreopened

The bug is back, running maint-0.3.3 (the one we're hoping to call stable soon).

I confirmed by adding the

+    tor_assert(router->cache_info.signing_key_cert);

line back, and the assert triggers.

May 21 20:51:04.207 [err] Bug: Assertion router->cache_info.signing_key_cert fai
led in dirserv_single_reachability_test at src/or/dirserv.c:3438. Stack trace: (on Tor 0.3.3.5-rc-dev 3c4353179f230476)

comment:18 Changed 5 months ago by arma

Ok, turns out the reason might be simple: we merged nickm's bug25415 into 0.3.4 but not into 0.3.3?

comment:19 Changed 5 months ago by arma

Ok, I'm running maint-0.3.3 with commit 699bb80 manually applied and it looks like it's starting.

comment:20 Changed 5 months ago by arma

Status: reopenedmerge_ready

comment:21 Changed 5 months ago by teor

Keywords: 033-backport added

comment:22 Changed 5 months ago by nickm

Resolution: fixed
Status: merge_readyclosed

Oops; good diagnosis, everybody.

Cherry-picked into 0.3.3 as 3d126632430fe6.

Note: See TracTickets for help on using tickets.