Opened 3 years ago

Closed 3 years ago

#20203 closed defect (fixed)

tor_assertion_failed on low memory

Reported by: tmpname0901 Owned by: nickm
Priority: High Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version: Tor: 0.2.8.7
Severity: Major Keywords: assert 028-backpoort TorCoreTeam201609
Cc: Actual Points: .3
Parent ID: Points:
Reviewer: Sponsor:

Description

Tor crashed with the output below. This was seen on a VMware VPS, running on a fully-updated 64-bit CentOS v6.8 system with 1GB of RAM. No indication of low memory was seen in the system logs, and no swap space was apparently used by the system.


Sep 03 16:27:12.000 [notice] We're low on memory. Killing circuits with over-long queues. (This behavior is controlled by MaxMemInQueues.)
Sep 03 16:27:12.000 [notice] Removed 782124800 bytes by killing 3 circuits; 1120 circuits remain alive. Also killed 0 non-linked directory connections.
Sep 03 16:27:12.000 [err] tor_assertion_failed_(): Bug: src/or/relay.c:2618: channel_flush_from_first_active_circuit: Assertion queue->n > 0 failed; aborting. (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: Assertion queue->n > 0 failed in channel_flush_from_first_active_circuit at src/or/relay.c:2618. Stack trace: (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(log_backtrace+0x42) [0x7fbe9d5e89f2] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(tor_assertion_failed_+0xa3) [0x7fbe9d5f9663] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(channel_flush_from_first_active_circuit+0x40b) [0x7fbe9d50c25b] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(channel_flush_some_cells+0x103) [0x7fbe9d55a773] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(scheduler_run+0x115) [0x7fbe9d54beb5] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(+0x10d1c7) [0x7fbe9d54c1c7] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(event_base_loop+0x53c) [0x7fbe9d655aec] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(do_main_loop+0x3d9) [0x7fbe9d4ecf59] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(tor_main+0xde5) [0x7fbe9d4edfc5] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(main+0x19) [0x7fbe9d4ea019] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /lib64/libc.so.6(libc_start_main+0xfd) [0x7fbe9c613d1d] (on Tor 0.2.8.7 )
Sep 03 16:27:12.000 [err] Bug: /usr/bin/tor(+0xaaf29) [0x7fbe9d4e9f29] (on Tor 0.2.8.7 )

Child Tickets

Change History (20)

comment:1 Changed 3 years ago by cypherpunks

Component: Core TorCore Tor/Tor

comment:2 Changed 3 years ago by nickm

Looks like we're getting something wrong with the circuit's active status when we kill it and remove its cells.

comment:3 Changed 3 years ago by nickm

Keywords: 028-backpoort added
Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

comment:4 Changed 3 years ago by cypherpunks

Circuit was marked for closed, but still active. While channel_flush_from_first_active_circuit() assume active circuit == non closed. It's about circuits_pending_close and race condition.

comment:5 Changed 3 years ago by cypherpunks

Removed 782124800 bytes by killing 3 circuits; 1120 circuits remain alive.

CompactDisc (700MB) filled over only 3 circuits? Doesn't sound sane. Remotely triggered assert crash.

comment:6 Changed 3 years ago by cypherpunks

it's probably need more attention, priority and severity.

comment:7 Changed 3 years ago by nickm

Priority: MediumHigh

comment:8 Changed 3 years ago by nickm

I think the problem is that marked_circuit_free_cells doesn't remove the circuit from the active list. One solution here is to just not do that assert, and allow channel_flush_from_first_active_circuit to skip empty circuits. Another is to make the circuit inactive when we remove its cells. Both might be a good idea.

comment:9 Changed 3 years ago by nickm

My branch bug20203_027 has a possible fix that takes both approaches.

comment:10 Changed 3 years ago by nickm

Actual Points: .2
Status: newneeds_review

comment:11 Changed 3 years ago by nickm

Keywords: TorCoreTeam201609 added

comment:12 Changed 3 years ago by cypherpunks

bugfix on 0.2.4.14-alpha

Might be yes. But 0.2.7.x and older actually calls circuitmux_detach_circuit by circuit_mark_for_close thus can't fail assertion (this one reported) after OOM-killer.

EDIT: Not sure that before 0.2.8.x Tor was immune for some another fail assertions after running OOM-killer

Last edited 3 years ago by cypherpunks (previous) (diff)

comment:13 Changed 3 years ago by cypherpunks

+    if (/*BUG(*/ queue->n <= 0 /*)*/) {
+      log_warn(LD_BUG, "Found a supposedly active circuit with no cells "
+               "to send. Trying to recover.");
+      circuitmux_set_num_cells(cmux, circ, queue->n);
+      circuit_mark_for_close(circ, END_CIRC_REASON_INTERNAL);
+      continue;
+    }

This code will trigger one extra bug warning.
If circuit wasn't marked for close yes (if some another bug), it will report:
"Called on non-marked circuit"
and skip cmux signaling and recover.

If circuit was marked for close already, it will report:

"Duplicate call to circuit_mark_for_close at" ...

is it ok?

Last edited 3 years ago by cypherpunks (previous) (diff)

comment:14 Changed 3 years ago by nickm

I've tried to fix both of those issues; better now?

comment:15 Changed 3 years ago by cypherpunks

better now?

seems so

comment:16 Changed 3 years ago by dgoulet

DG1: This seems "dangerous" to me as we pass a possible negative int as an unsigned int leading to setting a very large number of cells into the cmux:

circuitmux_set_num_cells(cmux, circ, queue->n);

Apart from the above, logic and code looks good to me.

Although, I'm not really familiar with this subsystem, a review from andrea would be desirable.

comment:17 Changed 3 years ago by nickm

Added a fixup commit there, and dialed back the protection a little.

comment:18 Changed 3 years ago by nickm

oh! Andrea informs me she acked this on IRC the other night. So, time to merge!

comment:19 Changed 3 years ago by nickm

Actual Points: .2.3
Milestone: Tor: 0.2.9.x-finalTor: 0.2.8.x-final
Owner: set to nickm
Status: needs_reviewaccepted

Merged.

comment:20 Changed 3 years ago by nickm

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.