Opened 10 years ago

Last modified 7 years ago

#1150 closed defect (Fixed)

Tor crash on FreeBSD

Reported by: foundbug Owned by:
Priority: High Milestone:
Component: Core Tor/Tor Version: 0.2.1.20
Severity: Keywords:
Cc: foundbug, nickm, Sebastian, arma Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Tor crashes at random times, sometimes it will last days or even weeks, sometimes only a few hours until a crash.

This has been going on for a while over the past few months with various tor versions.
I just now got around to compiling it with debug symbols.

This is the gdb from the crash today, its my first crash since compiling with debug so im not sure if it always crashes in the same place, I will update the bug on the next crash.
I'm not sure if I am doing this right, let me know if you need me to provide some different info.

OS is FreeBSD 7.2 32bit.

# gdb /usr/local/bin/tor tor.core
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...
Core was generated by `tor'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libz.so.4...done.
Loaded symbols for /lib/libz.so.4
Reading symbols from /usr/local/lib/libevent-1.4.so.3...done.
Loaded symbols for /usr/local/lib/libevent-1.4.so.3
Reading symbols from /usr/local/lib/libssl.so.5...done.
Loaded symbols for /usr/local/lib/libssl.so.5
Reading symbols from /usr/local/lib/libcrypto.so.5...done.
Loaded symbols for /usr/local/lib/libcrypto.so.5
Reading symbols from /lib/libthr.so.3...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /usr/lib/librt.so.1...done.
Loaded symbols for /usr/lib/librt.so.1
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0 set_streams_blocked_on_circ (circ=Variable "circ" is not available.
) at relay.c:1764
1764 relay.c: No such file or directory.

in relay.c

[New Thread 0x28501040 (LWP 100379)]
(gdb) bt
#0 set_streams_blocked_on_circ (circ=Variable "circ" is not available.
) at relay.c:1764
#1 0x080b5774 in connection_or_flush_from_first_active_circuit (conn=0x28dedbb0, max=1, now=1258328897) at relay.c:1832
#2 0x0807c73b in connection_or_flushed_some (conn=0x28dedbb0) at connection_or.c:293
#3 0x0806f3b4 in connection_flushed_some (conn=0x28dedbb0) at connection.c:2832
#4 0x080731ad in connection_handle_write (conn=0x28dedbb0, force=0) at connection.c:2385
#5 0x080ab249 in conn_write_callback (fd=164, events=4, _conn=0x28dedbb0) at main.c:488
#6 0x28187692 in event_base_loop () from /usr/local/lib/libevent-1.4.so.3
#7 0x281879c9 in event_loop () from /usr/local/lib/libevent-1.4.so.3
#8 0x080aaf49 in do_main_loop () at main.c:1444
#9 0x080ab167 in tor_main (argc=11, argv=0xbfbfec08) at main.c:2070
#10 0x080e8a02 in main (argc=Cannot access memory at address 0x0
) at tor_main.c:30

[Automatically added by flyspray2trac: Operating System: BSD]

Child Tickets

Change History (25)

comment:1 Changed 10 years ago by nickm

Were you seeing this in 0.2.1.19 too, or can you confirm it's new in 0.2.1.20?

comment:2 Changed 10 years ago by foundbug

I had regular crashes in 0.2.1.19 and previous versions, it started a few months ago, i'm not sure which version it started.
But I did not do core dumps on them.

comment:3 Changed 10 years ago by foundbug

Its crashed two more times since my last post.
Always in the same place.
I left the source code in the same folder this time so it doesn't complain about relay.c missing in gdb.

# gdb /usr/local/bin/tor tor.core
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...
Core was generated by `tor'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libz.so.4...done.
Loaded symbols for /lib/libz.so.4
Reading symbols from /usr/local/lib/libevent-1.4.so.3...done.
Loaded symbols for /usr/local/lib/libevent-1.4.so.3
Reading symbols from /usr/local/lib/libssl.so.5...done.
Loaded symbols for /usr/local/lib/libssl.so.5
Reading symbols from /usr/local/lib/libcrypto.so.5...done.
Loaded symbols for /usr/local/lib/libcrypto.so.5
Reading symbols from /lib/libthr.so.3...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /usr/lib/librt.so.1...done.
Loaded symbols for /usr/lib/librt.so.1
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0 set_streams_blocked_on_circ (circ=Variable "circ" is not available.
) at relay.c:1764
1764 edge->edge_blocked_on_circ = block;
[New Thread 0x28501040 (LWP 100379)]
(gdb) bt
#0 set_streams_blocked_on_circ (circ=Variable "circ" is not available.
) at relay.c:1764
#1 0x080b5774 in connection_or_flush_from_first_active_circuit (conn=0x2887dae0, max=1, now=1258404710) at relay.c:1832
#2 0x0807c73b in connection_or_flushed_some (conn=0x2887dae0) at connection_or.c:293
#3 0x0806f3b4 in connection_flushed_some (conn=0x2887dae0) at connection.c:2832
#4 0x080731ad in connection_handle_write (conn=0x2887dae0, force=0) at connection.c:2385
#5 0x080ab249 in conn_write_callback (fd=125, events=4, _conn=0x2887dae0) at main.c:488
#6 0x28187692 in event_base_loop () from /usr/local/lib/libevent-1.4.so.3
#7 0x281879c9 in event_loop () from /usr/local/lib/libevent-1.4.so.3
#8 0x080aaf49 in do_main_loop () at main.c:1444
#9 0x080ab167 in tor_main (argc=11, argv=0xbfbfec08) at main.c:2070
#10 0x080e8a02 in main (argc=Cannot access memory at address 0x0
) at tor_main.c:30
(gdb)

comment:4 Changed 10 years ago by nickm

Promising! Can you see what the value of 'edge' is at level #0 ?

comment:5 Changed 10 years ago by foundbug

how do I do that?

comment:6 Changed 10 years ago by foundbug

is this what you mean?

(gdb) print edge
$1 = (edge_connection_t *) 0xaaaaaaaa

comment:7 Changed 10 years ago by nickm

Yes, thanks!

comment:8 Changed 10 years ago by nickm

So it looks like the bug is that connection is getting freed, but left attached to the list of connections on the
circuit. That's not supposed to be able to happen; the next step will be figuring out why it is. I'll stare
at the code a bit this afternoon and see if I can spot it

comment:9 Changed 10 years ago by arma

Does this happen on a Tor client, or an exit relay?

I'm guessing client since you didn't say exit relay, but I figure we should
be certain. :)

comment:10 Changed 10 years ago by arma

Scenario: this is a Tor client. It has an open OR conn, with an open
circuit on it, and one or more edge streams attached to that circuit,
and circ->streams_blocked_on_n_conn is true (meaning one of these edge
streams has too much stuff on it to write it all out onto the OR conn --
meaning you're uploading a lot of stuff).

The circuit then gets a truncated cell on that OR conn (one of the relays
in the circuit went away suddenly). We respond to the inbound truncated
cell by doing two things in parallel. A) We queue up an outbound destroy
cell (but don't flush it yet). B) We mark the circuit for close. That
causes it to call connection_edge_destroy() on each attached stream, which
marks each stream for close.

Then when we finish conn_read_callback(), we call
close_closeable_connections(). That decides to free some of the edge
connections.

Then later we flush the destroy cell onto the OR conn. The circuit is still
marked but not freed (we only free circuits once per second as part of
circuit_close_all_marked()). So when the destroy cell gets flushed, we call
connection_or_flushed_some() which calls
connection_or_flush_from_first_active_circuit() which learns that in fact
circ->streams_blocked_on_n_conn == 1. So it decides to unblock the p_streams
on that circuit, and that's where we die.

comment:11 Changed 10 years ago by arma

So the problem is that the edge stream is getting freed but it's still linked
from the circuit queue. We tried to unlink it when we freed the stream:
connection_unlink() calls connection_about_to_close_connection() which calls

circ = circuit_get_by_edge_conn(edge_conn);
if (circ)

circuit_detach_stream(circ, edge_conn);

but the problem is that circuit_get_by_edge_conn() just looks at
conn->on_circuit, and the last thing connection_edge_destroy() does is
set conn->on_circuit to NULL. So the stream can't remove itself from the
circuit list when it frees itself.

comment:12 Changed 10 years ago by arma

Option one: in _circuit_mark_for_close() when we're calling connection_edge_destroy()
on each stream, when we're done iterating, set or_circ->{p,n}_streams to NULL. Ta-da,
no more edge streams attached.

Option two: when we send a destroy cell down the circuit, mark the circuit inactive,
so we don't try to read from its streams, since that wouldn't work anyway.

I think option one is a clear win (is there any reason not to do it?), and option two
we could do if we wanted (unless there are asserts somewhere that trigger when
circ->streams_blocked_on_n_conn is true yet there aren't any streams) but it's not
critical.

comment:13 Changed 10 years ago by arma

And the next thing to wonder is where else do we set conn->on_circuit to NULL and
then expect it to be able to remove itself upon free?

$ grep on_circuit *.c|grep NULL
circuitlist.c: conn->on_circuit = NULL;
circuituse.c: conn->on_circuit = NULL;
connection_edge.c: conn->on_circuit = NULL;
dns.c: exitconn->on_circuit = NULL;
dns.c: exitconn->on_circuit = NULL;

I think we're ok in each of the other cases. (The one in circuitlist.c is
fun though :)

comment:14 Changed 10 years ago by arma

To make it concrete, here's my diff:

diff --git a/src/or/circuitlist.c b/src/or/circuitlist.c
index 5918bdd..dab9142 100644
--- a/src/or/circuitlist.c
+++ b/src/or/circuitlist.c
@@ -1097,6 +1097,7 @@ _circuit_mark_for_close(circuit_t *circ, int reason, int l

edge_connection_t *conn;
for (conn=or_circ->n_streams; conn; conn=conn->next_stream)

connection_edge_destroy(or_circ->p_circ_id, conn);

+ or_circ->n_streams = NULL;

while (or_circ->resolving_streams) {

conn = or_circ->resolving_streams;

@@ -1120,6 +1121,7 @@ _circuit_mark_for_close(circuit_t *circ, int reason, int l

edge_connection_t *conn;
for (conn=ocirc->p_streams; conn; conn=conn->next_stream)

connection_edge_destroy(circ->n_circ_id, conn);

+ ocirc->p_streams = NULL;

}


circ->marked_for_close = line;

Tom, if you're comfortable patching your Tor, please give this a go and let us know
if it survives better. :)

(Credit to "ABCD" on IRC for finding the root of the bug.)

comment:15 Changed 10 years ago by nickm

The diff looks like a clear win; it should go into 0.2.1.x and get merged into 0.2.2.x.

Let's leave this bug open, though, until

a) we hear back from Tom,

and b) I can go over the big list of on_circuit = NULL that you posted above and double-check.

comment:16 Changed 10 years ago by keb

(forwarding a comment for the record)
#tor [2009-11-21 12:47] ABCDE> I am vote for both fixes from #1150 (both Options),
it save a little bw if circ was closed, cells will be droped by other side anyway.
Funny but this bug exploitable against exit relay too, with modified client and a little magic.

comment:17 Changed 10 years ago by foundbug

I have applied the patch, I will report back if it crashes or not.

comment:18 Changed 10 years ago by arma

ABCDE: as for saving a bit of bandwidth if circ was closed, check out r9907
and r9913, from when Nick was implementing this cell queue stuff.

Nick, can you revisit your r9913 and see if you still believe it?

As for the bug being exploitable against exit relays too, good point. It
wouldn't be with a truncated cell of course, but there are other ways to
induce the circuit to close itself suddenly. Yet another release to get the
fix into 0.2.1.x.

(And to remind weasel about it if he wants the fix to go into his 0.2.0.x
packages.)

comment:19 Changed 10 years ago by foundbug

So far there have not been any crashes since I applied the patch.
Thanks for fixing it.

comment:20 Changed 9 years ago by Sebastian

Can we close this? arma?

comment:21 Changed 9 years ago by arma

We shouldn't close yet, because I'm waiting for nickm to revisit r9913
and decide that he wants to change his mind about the logic there.

Nick?

comment:22 Changed 9 years ago by nickm

I'm not seeing the rationale behind dropping r9913 because "cells will be dropped by other side anyway". This could
be true for _some_ cases where we call connection_or_send_destroy, but is it really true of all of them?

comment:23 Changed 9 years ago by arma

Check out bug 1184. I was smart enough to make a separate bug report for the
separate bug, but then I forgot about it. It even answers Nick's question above.

I'm going to close this bug report as fixed, and leave the other one.

comment:24 Changed 9 years ago by arma

flyspray2trac: bug closed.

comment:25 Changed 7 years ago by nickm

Component: Tor ClientTor
Note: See TracTickets for help on using tickets.