Opened 6 years ago

Closed 2 years ago

#9754 closed defect (worksforme)

Tor 0.2.4.17-rc crashed with failed assertion

Reported by: communicator Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: unspecified
Severity: Normal Keywords: tor-relay
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

My Tor node is only a few weeks old. The stable version kept crashing every 1-2 days, so I updated to 0.2.4.17-rc. Now it crashes every 3-5 days. The newest crash looked like this:

Sep 16 21:14:07.000 [notice] Circuit handshake stats since last time: 1392050/1392134 TAP, 2074/2074 NTor.
Sep 16 21:55:12.000 [err] assert_buf_ok(): Bug: ../src/or/buffers.c:2488: assert_buf_ok: Assertion buf->datalen == 0 failed; aborting.
Sep 16 21:57:10.000 [notice] Tor 0.2.4.17-rc (git-36eb3e0da4c3a821) opening log file.

I'm running Debian Wheezy/64bit and got Tor from the experimental branch of ftp://ftp.debian.org/debian/

Child Tickets

Change History (22)

comment:1 Changed 6 years ago by nickm

Keywords: tor-relay added
Milestone: Tor: 0.2.4.x-final
Priority: normalmajor

A stack trace here, along with the results of "print *buf" at the point of failure, would help very much, if you can get one.

comment:2 Changed 6 years ago by arma

Since you're on Debian, look in /etc/default/tor for the 'ulimit -c unlimited' line, and uncomment it.

I believe the core file will show up in /var/lib/tor/

and then you'll want the tor-dbg deb to get the symbols so gdb can read it.

Feel free to ask if you need more help.

comment:3 in reply to:  2 Changed 6 years ago by communicator

Replying to arma:

Since you're on Debian, look in /etc/default/tor for the 'ulimit -c unlimited' line, and uncomment it.

I did this already a few crashes ago. But I can't find anything even remotly similar to a coredump anywhere on my server.

I believe the core file will show up in /var/lib/tor/

That directory is completely empty. The DataDirectory (I moved my Tor installation) contains only the expected files, but no cores. I searched the whole disk for files with names like *core* or *dump*.

Is there a global option to disable core dumps regardless of Tor's options?
Does Tor maybe need special privileges for AppArmor to write cores?

comment:4 Changed 6 years ago by nickm

Status: newneeds_information

Hm. I don't know this stuff, but we should really have a FAQ about it. I'll ask around.

comment:5 Changed 6 years ago by nickm

(Also, I wonder whether it's viable to try running your tor under valgrind.)

comment:6 Changed 6 years ago by nickm

I just asked the package maintainer, and he said:

12:05 < weasel> 1) uncomment 'ulimit -c unlimited' in /etc/default/tor
12:06 < weasel> 2) set DisableDebuggerAttachment 0
12:06 < weasel> 3) apt-get install tor-dbg
12:06 < weasel> wait until it dies
12:06 < weasel> 4) gdb /usr/bin/tor /var/lib/tor/core

comment:7 in reply to:  6 Changed 6 years ago by communicator

Maybe DisableDebuggerAttachment 0 does the trick. At least I didn't have that option set before. Unfortunately changing it killed the running tor instance:

Sep 19 18:20:18.000 [notice] Received reload signal (hup). Reloading config and resetting internal state.
Sep 19 18:20:18.000 [notice] Read configuration file "/usr/share/tor/tor-service-defaults-torrc".
Sep 19 18:20:18.000 [notice] Read configuration file "/etc/tor/torrc".
Sep 19 18:20:18.000 [warn] Failed to parse/validate config: While Tor is running, disabling DisableDebuggerAttachment is not allow
ed.
Sep 19 18:20:18.000 [err] Reading config failed--see warnings above. For usage, try -h.
Sep 19 18:20:18.000 [warn] Restart failed (config error?). Exiting.

So we'll have to wait another 3-5 days I guess.

comment:8 Changed 6 years ago by communicator

Well, it took almost five days this time. The good news is: now I have a core file. The bad news: Tor crashed without the failed assertion from last time.

Backtrace from gdb:

Core was generated by `/usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc --hush'.
Program terminated with signal 6, Aborted.
#0  0x00007f822b8ab1e5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f822b8ab1e5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f822b8ae398 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f822b8e67cb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f822b8f0a26 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f822b8f17a3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f822d0b003f in circuitmux_detach_circuit (cmux=0x7f8235060760, circ=0x7f823354d160) at ../src/or/circuitmux.c:1061
#6  0x00007f822d0a84db in circuit_set_circid_chan_helper (circ=circ@entry=0x7f823354d160, direction=direction@entry=1,
    id=id@entry=0, chan=chan@entry=0x0) at ../src/or/circuitlist.c:142
#7  0x00007f822d0a8ab3 in circuit_set_p_circid_chan (circ=circ@entry=0x7f823354d160, id=id@entry=0, chan=chan@entry=0x0)
    at ../src/or/circuitlist.c:217
#8  0x00007f822d0bc893 in command_process_destroy_cell (chan=<optimized out>, cell=<optimized out>) at ../src/or/command.c:506
#9  command_process_cell (chan=<optimized out>, cell=0x7fff497c5870) at ../src/or/command.c:153
#10 0x00007f822d09da1b in channel_tls_handle_cell (cell=cell@entry=0x7fff497c5870, conn=conn@entry=0x7f82341fccb0)
    at ../src/or/channeltls.c:923
#11 0x00007f822d0ddc57 in connection_or_process_cells_from_inbuf (conn=0x7f82341fccb0) at ../src/or/connection_or.c:1972
#12 0x00007f822d0e0ef2 in connection_or_process_inbuf (conn=conn@entry=0x7f82341fccb0) at ../src/or/connection_or.c:483
#13 0x00007f822d0cc4c5 in connection_process_inbuf (conn=conn@entry=0x7f82341fccb0, package_partial=package_partial@entry=1)
    at ../src/or/connection.c:4001
#14 0x00007f822d0d265d in connection_handle_read_impl (conn=0x7f82341fccb0) at ../src/or/connection.c:2839
#15 connection_handle_read (conn=conn@entry=0x7f82341fccb0) at ../src/or/connection.c:2880
#16 0x00007f822d02e061 in conn_read_callback (fd=<optimized out>, event=<optimized out>, _conn=0x7f82341fccb0)
    at ../src/or/main.c:718
#17 0x00007f822c6a8ccc in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#18 0x00007f822d02e9f5 in do_main_loop () at ../src/or/main.c:1992
#19 0x00007f822d0301de in tor_main (argc=4, argv=0x7fff497c5fb8) at ../src/or/main.c:2708
#20 0x00007f822b897995 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#21 0x00007f822d02aabb in _start ()

Is this a different story?
Do you need further information (which gdb commands)?

Next crash coming in 3-5 days. ;-)

comment:9 Changed 6 years ago by nickm

Interesting!

The failing line seems to be:

      cmux->policy->free_circ_data(cmux,
                                   cmux->policy_data,
                                   circ,
                                   hashent->muxinfo.policy_data);

The most useful stuff would be the contents of cmux and circ at level 5, from that crash. So, in gdb:

 up 5
 p *cmux
 p *circ
 p *cmux->policy
 p *cmux->policy_data
 p hashent
 p *hashent

comment:10 in reply to:  9 Changed 6 years ago by communicator

Here you go:

Core was generated by `/usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc --hush'.
Program terminated with signal 6, Aborted.
#0  0x00007f822b8ab1e5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) up 5
#5  0x00007f822d0b003f in circuitmux_detach_circuit (cmux=0x7f8235060760, circ=0x7f823354d160) at ../src/or/circuitmux.c:1061
1061          cmux->policy->free_circ_data(cmux,
(gdb) p *cmux
$1 = {n_circuits = 46, n_active_circuits = 0, n_cells = 0, chanid_circid_map = 0x7f82336a42c0, active_circuits_head = 0x0,
  active_circuits_tail = 0x0, policy = 0x7f822d3d49c0, policy_data = 0x7f8232472860}
(gdb) p *circ
$2 = {magic = 2561392719, n_chan = 0x7f8239335120, n_circ_id = 2147556202, n_mux = 0x7f8234c82a70, n_chan_cells = {head = 0x0,
    tail = 0x0, n = 0, insertion_times = 0x0}, n_hop = 0x0, streams_blocked_on_n_chan = 0, streams_blocked_on_p_chan = 0,
  state = 3 '\003', purpose = 1 '\001', package_window = 1000, deliver_window = 1000, n_chan_create_cell = 0x0,
  timestamp_began = {tv_sec = 1380050459, tv_usec = 116624}, timestamp_created = {tv_sec = 1380050459, tv_usec = 116624},
  timestamp_dirty = 0, marked_for_close = 0, marked_for_close_file = 0x0, dirreq_id = 0, next = 0x7f8234690830,
  next_active_on_n_chan = 0x0, prev_active_on_n_chan = 0x0}
(gdb) p *cmux->policy
$3 = {alloc_cmux_data = 0x7f822d0b35a0 <ewma_alloc_cmux_data>, free_cmux_data = 0x7f822d0b24d0 <ewma_free_cmux_data>,
  alloc_circ_data = 0x7f822d0b3340 <ewma_alloc_circ_data>, free_circ_data = 0x7f822d0b20a0 <ewma_free_circ_data>,
  notify_circ_active = 0x7f822d0b2b10 <ewma_notify_circ_active>,
  notify_circ_inactive = 0x7f822d0b2800 <ewma_notify_circ_inactive>, notify_set_n_cells = 0,
  notify_xmit_cells = 0x7f822d0b2d40 <ewma_notify_xmit_cells>, pick_active_circuit = 0x7f822d0b25e0 <ewma_pick_active_circuit>}
(gdb) p *cmux->policy_data
$4 = {magic = 802730346}
(gdb) p hashent
$5 = (chanid_circid_muxinfo_t *) 0x7f8232ab8bd0
(gdb) p *hashent
$6 = {node = {hte_next = 0x0, hte_hash = 25845233}, chan_id = 7764721, circ_id = 2147613735, muxinfo = {cell_count = 0,
    direction = CELL_DIRECTION_IN, policy_data = 0x7f82365e2f90, mark = 0}}

comment:11 Changed 6 years ago by andrea

Hmmm. It's not entirely clear to me if that crash is happening at the call to cmux->policy->free_circ_data or in that function but screwing up the stack. That pointer looks okay, and the parameters look possibly okay [1]. The only thing in ewma_free_circ_data() that depends on anything else to not crash is tor_free(), so perhaps this could be a heap corruption bug.

Running it under valgrind is probably very helpful; it would also be nice to have a copy of the core dump, tor binary and torrc.

[1] But the circuit id in the hash entry (2147613735) doesn't match the one in circ (2147556202), and the magic number in circ indicates it's an or_circuit_t. This is probably the cmux for that circuit's reverse direction, but it'd be nice to see the circuit as an or_circuit_t to verify that p_circ_id is 2147613735.

comment:12 Changed 6 years ago by andrea

Also, this 36eb3e0da4c3a821 revision isn't our 0.2.4.17-rc, which is 00fb525b23cf070f. Where can I find whatever extra patches Debian is adding?

comment:13 Changed 6 years ago by nickm

Agreed; running it under valgrind would maybe get some useful results.

Is there any more debugging/logging code that would help figure out what's going on?

Also:

Where can I find whatever extra patches Debian is adding?

https://gitweb.torproject.org/debian/tor.git

comment:14 in reply to:  13 ; Changed 6 years ago by andrea

Replying to nickm:

Agreed; running it under valgrind would maybe get some useful results.

Is there any more debugging/logging code that would help figure out what's going on?

I think valgrind would be the most useful thing. Also, it'd be nice to see how this binary was built and if anything changes if built with -O0; since that line doesn't look like there's any very clear possible cause, and the circuitmux.c in the Debian repo matches ours, I'd be curious to see if the actual crash was elsewhere and the optimizer has confused things. Since so far the exact site of the crash seems nondeterministic, though, we may not see this again if the reporter re-runs with a different build.

Also:

Where can I find whatever extra patches Debian is adding?

https://gitweb.torproject.org/debian/tor.git

Thanks

comment:15 in reply to:  11 ; Changed 6 years ago by communicator

Replying to andrea:

Running it under valgrind is probably very helpful;

I'll be happy to try - do you have a "tor with valgrind 101" anywhere? ;-)

it would also be nice to have a copy of the core dump, tor binary and torrc.

What is the easiest way to share ~500MB? Would I kill this server if I attached a core that big?

comment:16 in reply to:  14 Changed 6 years ago by communicator

Replying to andrea:

Also, it'd be nice to see how this binary was built and if anything changes if built with -O0;

I built the binary using

export DEB_BUILD_OPTIONS='debug nostrip noopt' ;dpkg-buildpackage -rfakeroot -uc -b

to get the debug symbols. The reason I compiled Tor by hand in the first place was to make use of the optimized openssl-library. I didn't know of the botnet attacking Tor at the time and wanted to run 50Mbps on one Tor instance if possible.

The stock Debian versions of Tor and openssl crashed the same though.

comment:17 in reply to:  15 ; Changed 6 years ago by arma

Replying to communicator:

Would I kill this server if I attached a core that big?

Your core file likely includes sensitive keys (including your long-term relay identity key), as well as maybe sensitive client traffic. You should keep it secret.

comment:18 in reply to:  17 Changed 6 years ago by communicator

Replying to arma:

Your core file likely includes sensitive keys [...] You should keep it secret.

Good point.

Maybe the next core will help?
This time it took nearly 6 days to crash. My /var/log/syslog shows these lines:

Sep 30 17:11:37 communicator kernel: [906980.459109] do_general_protection: 66 callbacks suppressed
Sep 30 17:11:37 communicator kernel: [906980.459140] tor[22059] general protection ip:7fa9d532315c sp:7fff48e604f0 error:0 in tor[
7fa9d5293000+1b8000]

The stacktrace from gdb:

Core was generated by `/usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc --hush'.
Program terminated with signal 11, Segmentation fault.
#0  circuit_get_by_rend_token_and_purpose (purpose=purpose@entry=3 '\003',
    token=token@entry=0x7fff48e60aa0 "hS(i\252\272\026\267M\366\303<퉥\215iE\247\207", len=20) at ../src/or/circuitlist.c:1141
1141        if (! circ->marked_for_close &&
(gdb) bt
#0  circuit_get_by_rend_token_and_purpose (purpose=purpose@entry=3 '\003',
    token=token@entry=0x7fff48e60aa0 "hS(i\252\272\026\267M\366\303<퉥\215iE\247\207", len=20) at ../src/or/circuitlist.c:1141
#1  0x00007fa9d5325c75 in circuit_get_rendezvous (
    cookie=cookie@entry=0x7fff48e60aa0 "hS(i\252\272\026\267M\366\303<퉥\215iE\247\207") at ../src/or/circuitlist.c:1155
#2  0x00007fa9d52cf1f3 in rend_mid_establish_rendezvous (circ=0x7fa9dbf37890,
    request=request@entry=0x7fff48e60aa0 "hS(i\252\272\026\267M\366\303<퉥\215iE\247\207", request_len=request_len@entry=20)
    at ../src/or/rendmid.c:238
#3  0x00007fa9d52ce87c in rend_process_relay_cell (circ=circ@entry=0x7fa9dbf37890, layer_hint=layer_hint@entry=0x0, command=33,
    length=20, payload=payload@entry=0x7fff48e60aa0 "hS(i\252\272\026\267M\366\303<퉥\215iE\247\207")
    at ../src/or/rendcommon.c:1440
#4  0x00007fa9d52c5e7a in connection_edge_process_relay_cell (cell=cell@entry=0x7fff48e60a90, circ=circ@entry=0x7fa9dbf37890,
    conn=conn@entry=0x0, layer_hint=layer_hint@entry=0x0) at ../src/or/relay.c:1578
#5  0x00007fa9d52c7a71 in circuit_receive_relay_cell (cell=cell@entry=0x7fff48e60a90, circ=circ@entry=0x7fa9dbf37890,
    cell_direction=cell_direction@entry=CELL_DIRECTION_OUT) at ../src/or/relay.c:212
#6  0x00007fa9d533770c in command_process_relay_cell (chan=0x7fa9de927520, cell=0x7fff48e60a90) at ../src/or/command.c:465
#7  command_process_cell (chan=0x7fa9de927520, cell=0x7fff48e60a90) at ../src/or/command.c:149
#8  0x00007fa9d5318a1b in channel_tls_handle_cell (cell=cell@entry=0x7fff48e60a90, conn=conn@entry=0x7fa9dcf0a4d0)
    at ../src/or/channeltls.c:923
#9  0x00007fa9d5358c57 in connection_or_process_cells_from_inbuf (conn=0x7fa9dcf0a4d0) at ../src/or/connection_or.c:1972
#10 0x00007fa9d535bef2 in connection_or_process_inbuf (conn=conn@entry=0x7fa9dcf0a4d0) at ../src/or/connection_or.c:483
#11 0x00007fa9d53474c5 in connection_process_inbuf (conn=conn@entry=0x7fa9dcf0a4d0, package_partial=package_partial@entry=1)
    at ../src/or/connection.c:4001
#12 0x00007fa9d534d65d in connection_handle_read_impl (conn=0x7fa9dcf0a4d0) at ../src/or/connection.c:2839
#13 connection_handle_read (conn=conn@entry=0x7fa9dcf0a4d0) at ../src/or/connection.c:2880
#14 0x00007fa9d52a9061 in conn_read_callback (fd=<optimized out>, event=<optimized out>, _conn=0x7fa9dcf0a4d0)
    at ../src/or/main.c:718
#15 0x00007fa9d4923ccc in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#16 0x00007fa9d52a99f5 in do_main_loop () at ../src/or/main.c:1992
#17 0x00007fa9d52ab1de in tor_main (argc=4, argv=0x7fff48e611d8) at ../src/or/main.c:2708
#18 0x00007fa9d3b12995 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#19 0x00007fa9d52a5abb in _start ()

If anyone can give me pointers on how to use valgrind, I will try that.

comment:19 Changed 6 years ago by communicator

After an uptime of 40 days I think I can assume that the crashes have stopped.
The problem is: I didn't change a thing.

But the suspicious RSA-errors have also stopped. So maybe it was something those bots did? Maybe those bots sent unexpected input in some cases?

Anyways, I can't help with further information, because my Tor doesn't seem to crash anymore.

comment:20 Changed 6 years ago by arma

Milestone: Tor: 0.2.4.x-finalTor: unspecified

We still don't know what this is -- could be memory corruption, could be a heap corruption bug.

I opened #10817 so next time this happens to somebody, we're better prepared to tell them how to use Valgrind with their deb.

comment:21 Changed 6 years ago by arma

(I first thought that it was some alternate Tor client implementation that was tickling a bug, but now that I see three different places that it seems to have died at, I no longer think that.)

comment:22 Changed 2 years ago by nickm

Resolution: worksforme
Severity: Normal
Status: needs_informationclosed

This didn't wind up recurring, did it? Please reopen if so.

Note: See TracTickets for help on using tickets.