Opened 7 years ago

Closed 7 years ago

#7086 closed defect (fixed)

moria1 asserts on startup

Reported by: arma Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-auth
Cc: andrea Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Oct 10 23:23:01.238 [notice] Bootstrapped 90%: Establishing a Tor circuit.
Oct 10 23:23:10.342 [notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
Oct 10 23:23:10.632 [warn] tor_addr_is_internal(): Bug: tor_addr_is_internal() called with a non-IP address of type 0
Oct 10 23:23:10.821 [warn] tor_addr_is_internal(): Bug: tor_addr_is_internal() called with a non-IP address of type 0
Oct 10 23:23:11.009 [err] channel_change_state(): Bug: src/or/channel.c:1913: channel_change_state: Assertion !(chan->incoming_queue) || smartlist_len(chan->incoming_queue) == 0 failed; aborting.
src/or/channel.c:1913 channel_change_state: Assertion !(chan->incoming_queue) || smartlist_len(chan->incoming_queue) == 0 failed; aborting.

The tor_addr_is_internal bug lines appear to be begindir requests:

Oct 10 23:23:10.819 [info] connection_exit_connect_dir(): Opening local connection for anonymized directory exit
Oct 10 23:23:10.821 [warn] tor_addr_is_internal(): Bug: tor_addr_is_internal() called with a non-IP address of type 0
Oct 10 23:23:10.821 [info] _connection_free(): Freeing linked Directory connection [writing] with 0 bytes on inbuf, 0 on outbuf.

I'm working on getting a traceback, but the PIE stuff makes it hard.

Child Tickets

Change History (20)

comment:1 Changed 7 years ago by arma

Keywords: tor-auth added; tor-authority removed

comment:2 Changed 7 years ago by nickm

Cc: andrea added

comment:3 Changed 7 years ago by arma

#0  0x00007fb6c6bde1b5 in *__GI_raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fb6c6be0fc0 in *__GI_abort () at abort.c:92
#2  0x0000000000465f5b in channel_change_state (chan=0x616cd60, 
    to_state=CHANNEL_STATE_CLOSED) at src/or/channel.c:1823
#3  0x00000000004a90bb in connection_or_about_to_close (or_conn=0x616ceb0)
    at src/or/connection_or.c:599
#4  0x000000000040b82d in connection_unlink (conn=0x2b59) at src/or/main.c:384
#5  0x000000000040ba98 in conn_close_if_marked () at src/or/main.c:931
#6  close_closeable_connections () at src/or/main.c:701
#7  0x000000000040cb18 in run_scheduled_events (timer=<value optimized out>, 
    arg=<value optimized out>) at src/or/main.c:1534
#8  second_elapsed_callback (timer=<value optimized out>, 
    arg=<value optimized out>) at src/or/main.c:1690
#9  0x00007fb6c7917344 in event_base_loop () from /usr/lib/libevent-1.4.so.2
#10 0x000000000040a021 in do_main_loop () at src/or/main.c:1986
#11 0x000000000040a35d in tor_main (argc=<value optimized out>, 
    argv=0x7fff2003e038) at src/or/main.c:2694
#12 0x00007fb6c6bcac8d in __libc_start_main (main=<value optimized out>, 
    argc=<value optimized out>, ubp_av=<value optimized out>, 
    init=<value optimized out>, fini=<value optimized out>, 
    rtld_fini=<value optimized out>, stack_end=0x7fff2003e028)
    at libc-start.c:228
#13 0x0000000000408809 in _start ()
(gdb) up
#2  0x0000000000465f5b in channel_change_state (chan=0x616cd60, 
    to_state=CHANNEL_STATE_CLOSED) at src/or/channel.c:1823
1823      tor_assert(channel_state_is_valid(from_state));
(gdb) print from_state
$1 = CHANNEL_STATE_CLOSING
(gdb) print *chan
$2 = {magic = 2316903463, state = CHANNEL_STATE_CLOSED,
  global_identifier = 4015, registered = 1 '\001',
  reason_for_closing = CHANNEL_CLOSE_FROM_BELOW,
  timestamp_created = 1349927428, timestamp_active = 1349927458, free = 0, 
  close = 0x46bbf0 <channel_tls_close_method>, 
  describe_transport = 0x46bb20 <channel_tls_describe_transport_method>, 
  dumpstats = 0, cell_handler = 0, var_cell_handler = 0, 
  get_remote_addr = 0x46c110 <channel_tls_get_remote_addr_method>, 
  get_remote_descr = 0x46b9b0 <channel_tls_get_remote_descr_method>, 
  has_queued_writes = 0x46b8c0 <channel_tls_has_queued_writes_method>, 
  is_canonical = 0x467fd0 <channel_tls_is_canonical_method>, 
  matches_extend_info = 0x46a600 <channel_tls_matches_extend_info_method>, 
  matches_target = 0x46a4c0 <channel_tls_matches_target_method>, 
  write_cell = 0x46b780 <channel_tls_write_cell_method>, 
  write_packed_cell = 0x46b630 <channel_tls_write_packed_cell_method>, 
  write_var_cell = 0x46b4f0 <channel_tls_write_var_cell_method>,
  identity_digest = "�i��x �\205嫴t\\\025`\016z�uP", nickname = 0x0,
  next_with_same_id = 0x0, prev_with_same_id = 0x0,
  incoming_queue = 0x155a5f40, outgoing_queue = 0x0, cmux = 0x3700ee0,
  circ_id_type = CIRC_ID_TYPE_LOWER, next_circ_id = 25699,
  num_n_circuits = 0, num_p_circuits = 0, is_bad_for_new_circs = 1,
  is_client = 0, is_incoming = 0, is_local = 0, timestamp_client = 0,
  timestamp_drained = 0, timestamp_recv = 1349927458, timestamp_xmit = 0,
  timestamp_last_added_nonpadding = 1349927428, dirreq_id = 0,
  n_cells_recved = 1, n_cells_xmitted = 0}
(gdb) print chan->incoming_queue
$5 = (smartlist_t *) 0x155a5f40
(gdb) print *chan->incoming_queue
$6 = {list = 0x54d6090, num_used = 1, capacity = 16}

comment:4 Changed 7 years ago by nickm

That's not the same assertion you reported the first time, I think. The first assertion was at line 1913; this is at 1823 apparently? Or are the line numbers messed up?

comment:5 Changed 7 years ago by nickm

Priority: normalmajor

comment:6 Changed 7 years ago by andrea

I can repro this; still working on causation.

comment:7 Changed 7 years ago by andrea

Okay, I think I understand this one now. There are three places new channels get created: from circuitbuild.c in channel_connect_for_circuit(), which then immediately calls command_setup_channel() on them; incoming connections from a channel listener get passed to the handler, command_handle_incoming_channel() in command.c, which does the same. The other case is dirserv reachability testing dirserv_single_reachability_test() of dirserv.c, which doesn't install incoming cell handlers. Whoops! The channel gets a cell from the remote end that stays stuck forever and then closes with a non-empty queue.

Obvious fix: make it call command_setup_channel() too so those channels route cells to the right place. Query, though: if no handlers are available to drain the queue when closing the channel, is asserting in the case that it has leftover cells the right thing?

comment:8 Changed 7 years ago by andrea

Status: newneeds_review

See my bug7086 branch for a fix.

comment:9 Changed 7 years ago by arma

I haven't looked at the patch, but it seems to resolve the assert for me.

Now I still have several-per-second of

Oct 12 04:27:33.093 [warn] tor_addr_is_internal(): Bug: tor_addr_is_internal() called with a non-IP address of type 0

presumably one for each begindir request I get.

comment:10 Changed 7 years ago by andrea

Think I could get you to breakpoint that message and get me a stack trace? I haven't been able to repro that one.

comment:11 Changed 7 years ago by nickm

Status: needs_reviewnew

I merged athena's bug7086, as it seems cool.

A user on IRC reported the tor_addr_is_internal() thing yesterday happening in a release before this; it might not be athena's code. A stack trace would help!

comment:12 Changed 7 years ago by nickm

I added 8586611718ee9c4cc81eb8fe50c937619393ade6 to try to help track down the tor_addr_is_internal() issue. It will add the filename and line number of the calling function to the log message.

comment:13 Changed 7 years ago by arma

Oct 12 13:29:39.295 [warn] tor_addr_is_internal_(): Bug: tor_addr_is_internal() called from src/or/config.c:1991 with a non-IP address of type 0

comment:14 Changed 7 years ago by arma

I turned it into an assert. Not a begindir after all!

#0  0x00007ffff69fa1b5 in *__GI_raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ffff69fcfc0 in *__GI_abort () at abort.c:92
#2  0x00000000004dab9c in tor_addr_is_internal_ (addr=0x15d42144, 
    for_listening=<value optimized out>, filename=<value optimized out>, 
    lineno=<value optimized out>) at src/common/address.c:361
#3  0x000000000048beb5 in is_local_addr (addr=0x7745) at src/or/config.c:1991
#4  0x00000000004b632d in write_http_response_header_impl (conn=0x15d420f0, 
    length=<value optimized out>, type=<value optimized out>, encoding=0x0, 
    extra_headers=0x53e39c "X-Descriptor-Not-New: Yes\r\n", 
    cache_lifetime=<value optimized out>) at src/or/directory.c:2373
#5  0x00000000004bb758 in directory_handle_command_post (conn=0x15d420f0, 
    headers=0x5296680 "POST /tor/ HTTP/1.0\r\nContent-Length: 3254\r\nHost: 128.31.0.39:9101\r\n\r\n", 
    body=0x169e5b30 "router EntrP 202.125.215.10 4440 0 0\nplatform Tor 0.2.2.35 on Linux i686\nopt protocols Link 1 2 Circuit 1\npublished 2012-10-12 17:32:13\nopt fingerprint 31AA 6A93 FF40 ABD1 6165 DEBF 2248 4CAC 692E CAC"..., 
    body_len=3254) at src/or/directory.c:3408
#6  0x00000000004bbd5e in directory_handle_command (conn=0x15d420f0)
    at src/or/directory.c:3507
#7  0x00000000004bddbd in connection_dir_process_inbuf (conn=0x7745)
    at src/or/directory.c:2286
#8  0x000000000049b1c9 in connection_handle_read_impl (conn=0x15d420f0)
#9  connection_handle_read (conn=0x15d420f0) at src/or/connection.c:2787
#10 0x000000000040c066 in conn_read_callback (fd=<value optimized out>, 
    event=30533, _conn=0x6) at src/or/main.c:720
#11 0x00007ffff7733344 in event_base_loop () from /usr/lib/libevent-1.4.so.2
#12 0x000000000040a021 in do_main_loop () at src/or/main.c:1986
#13 0x000000000040a35d in tor_main (argc=<value optimized out>, 
    argv=0x7fffffffe708) at src/or/main.c:2694
#14 0x00007ffff69e6c8d in __libc_start_main (main=<value optimized out>, 
    argc=<value optimized out>, ubp_av=<value optimized out>, 
    init=<value optimized out>, fini=<value optimized out>, 
    rtld_fini=<value optimized out>, stack_end=0x7fffffffe6f8)
    at libc-start.c:228
#15 0x0000000000408809 in _start ()

comment:15 Changed 7 years ago by nickm

Okay, it's in a directory connection, and finding that conn->addr is unassigned. It's a connection that we accepted, since we are calling write_http_response_header.

But looking at connection_handler_listener_read(), I don't see how it can ever generate a directory connection that doesn't have conn->addr set -- the tor_addr_copy() on connection.c line 1254 seems pretty convincing.

Are we sure this didn't come from begindir? I guess it doesn't look that way from the call chain, or from the way that it's including an X-Descriptor-Not-New header.

It could be useful to see the output of 'print *conn' in write_http_response_header_impl.

comment:16 Changed 7 years ago by arma

(gdb) print *conn
$2 = {_base = {magic = 2575892462, state = 6 '\006', type = 9, purpose = 16,
    read_blocked_on_bw = 0, write_blocked_on_bw = 0,
    hold_open_until_flushed = 0, inbuf_reached_eof = 0, in_flushed_some = 0,
    linked = 1, reading_from_linked_conn = 1, writing_to_linked_conn = 0,
    active_on_link = 1, linked_conn_is_closed = 0, proxy_state = 0, s = -1,
    conn_array_index = 2448, read_event = 0x56def50, write_event = 0x5388a70,
    inbuf = 0x4224570, outbuf = 0x47957a0, outbuf_flushlen = 0,
    timestamp_lastread = 1350070200, timestamp_lastwritten = 1350070199,
    timestamp_created = 1350070199, socket_family = 0, addr = {family = 0,
      addr = {dummy_ = 0, in_addr = {s_addr = 0}, in6_addr = {__in6_u = {
            __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0,
              0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, port = 0,
    marked_for_close = 0, marked_for_close_file = 0x0,
    address = 0xc824360 "200.x.y.z:33779", linked_conn = 0x190d120,
    global_identifier = 5129}, requested_resource = 0x0, dirconn_direct = 0,
  dir_spool_src = DIR_SPOOL_NONE, router_purpose = 0 '\000',
  fingerprint_stack = 0x0, cached_dir = 0x0, cached_dir_offset = 0,
  zlib_state = 0x0, rend_data = 0x0,
  identity_digest = '\000' <repeats 19 times>, dirreq_id = 4}

comment:17 Changed 7 years ago by nickm

linked=1, it would appear. So, begindir.

comment:18 Changed 7 years ago by nickm

So, two options seem to be to always set addr on linked connections ("What could go wrong"?), or to look at the linked connection's address when deciding whether to report conn->_base.address in an X-Your-Address-Is header.

comment:19 in reply to:  18 Changed 7 years ago by arma

Replying to nickm:

So, two options seem to be to always set addr on linked connections ("What could go wrong"?), or to look at the linked connection's address when deciding whether to report conn->_base.address in an X-Your-Address-Is header.

I haven't been paying enough attention. Where did the bug get introduced?

comment:20 Changed 7 years ago by arma

Resolution: fixed
Status: newclosed

Shifting this other bug over to #7112, since 'moria1 asserts on startup' is solved.

Note: See TracTickets for help on using tickets.