Opened 7 weeks ago

Closed 13 days ago

#23751 closed defect (fixed)

[warn] tor_bug_occurred_: Bug: src/common/buffers.c, etc.

Reported by: Felixix Owned by: dgoulet
Priority: Very High Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-channel, tor-sched, tor-buffer
Cc: Actual Points:
Parent ID: #23993 Points:
Reviewer: Sponsor: SponsorV

Description

Duplicate to #23676 ?
But the buffers.c:690 seems new.

[warn] tor_bug_occurred_: Bug: src/common/buffers.c:690: buf_add: Non-fatal assertion !(buf->datalen >= INT_MAX - string_len) failed.
[warn] tor_bug_occurred_: Bug: src/or/scheduler_kist.c:406: update_socket_written: Non-fatal assertion !((!ent)) failed.
[warn] tor_bug_occurred_: Bug: src/or/scheduler_kist.c:374: socket_can_write: Non-fatal assertion !((!ent)) failed.

At the end of the log file's time the tor process stays at 100% cpu.
Server hang up and a reboot was necessary.
I jumped on 3.2.2a already.

Sep 21 23:14:06.010 [notice] Tor 0.3.2.1-alpha (git-290274dbb5428bc5) running on FreeBSD with Libevent 2.1.8-stable, OpenSSL LibreSSL 2.5.4, Zlib 1.2.8, Liblzma 5.2.2, and Libzstd 1.3.1.

Oct 02 02:14:18.000 [notice] Heartbeat: Tor's uptime is 10 days 2:59 hours, with 193846 circuits open. I've sent 6390.97 GB and received 6348.01 GB.
Oct 02 02:14:18.000 [notice] Circuit handshake stats since last time: 4007/4007 TAP, 132384/132384 NTor.
Oct 02 02:14:18.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 418760 v4 connections; and received 2215 v1 connections, 43847 v2 connections, 101525 v3 connections, and 746328 v4 connections.
Oct 02 02:31:24.000 [notice] Scheduler type KISTLite has been enabled.
Oct 02 02:58:02.000 [warn] tor_bug_occurred_: Bug: src/common/buffers.c:690: buf_add: Non-fatal assertion !(buf->datalen >= INT_MAX - string_len) failed. (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug: Non-fatal assertion !(buf->datalen >= INT_MAX - string_len) failed in buf_add at src/common/buffers.c:690. Stack trace: (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x11a2bf8 <log_backtrace+0x48> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x11bd82c <tor_bug_occurred_+0x10c> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x11a435b <buf_add+0x13b> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1129572 <connection_write_to_buf_impl_+0x122> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10f2d38 <channel_tls_update_marks+0x1068> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10e78cb <channel_write_cell_generic_+0x23b> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10e7ed8 <channel_write_packed_cell+0x38> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x109e20a <channel_flush_from_first_active_circuit+0x31a> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10e8702 <channel_flush_some_cells+0xd2> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10db0f6 <scheduler_can_use_kist+0x826> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10d9912 <scheduler_init+0x152> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x801b37a96 <event_base_assert_ok_nolock_+0x926> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x801b33cfe <event_base_loop+0x50e> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1072d15 <do_main_loop+0x575> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1075119 <tor_main+0xe9> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1070b69 <main+0x19> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1070a61 <_start+0x1a1> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] write_to_buf failed on an orconn; notifying of error (fd 1935)
Oct 02 02:58:02.000 [warn] Scheduler asked to release channel 1994256 but it wasn't in channels_pending
Oct 02 02:58:02.000 [warn] tor_bug_occurred_: Bug: src/or/scheduler_kist.c:406: update_socket_written: Non-fatal assertion !((!ent)) failed. (Future instances of this warning will be silenced.) (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug: Non-fatal assertion !((!ent)) failed in update_socket_written at src/or/scheduler_kist.c:406. Stack trace: (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x11a2bf8 <log_backtrace+0x48> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x11bd82c <tor_bug_occurred_+0x10c> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10db190 <scheduler_can_use_kist+0x8c0> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10d9912 <scheduler_init+0x152> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x801b37a96 <event_base_assert_ok_nolock_+0x926> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x801b33cfe <event_base_loop+0x50e> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1072d15 <do_main_loop+0x575> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1075119 <tor_main+0xe9> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1070b69 <main+0x19> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1070a61 <_start+0x1a1> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] tor_bug_occurred_: Bug: src/or/scheduler_kist.c:374: socket_can_write: Non-fatal assertion !((!ent)) failed. (Future instances of this warning will be silenced.) (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug: Non-fatal assertion !((!ent)) failed in socket_can_write at src/or/scheduler_kist.c:374. Stack trace: (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x11a2bf8 <log_backtrace+0x48> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x11bd82c <tor_bug_occurred_+0x10c> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10db340 <scheduler_can_use_kist+0xa70> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x10d9912 <scheduler_init+0x152> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x801b37a96 <event_base_assert_ok_nolock_+0x926> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x801b33cfe <event_base_loop+0x50e> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1072d15 <do_main_loop+0x575> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1075119 <tor_main+0xe9> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1070b69 <main+0x19> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5)
Oct 02 02:58:02.000 [warn] Bug:     0x1070a61 <_start+0x1a1> at /usr/local/bin/tor (on Tor 0.3.2.1-alpha 290274dbb5428bc5) 

Cheers!

Child Tickets

Change History (14)

comment:1 Changed 7 weeks ago by dgoulet

Component: - Select a componentCore Tor/Tor
Keywords: tor-channel tor-sched tor-buffer added
Milestone: Tor: 0.3.2.x-final
Priority: MediumHigh

comment:2 Changed 7 weeks ago by dgoulet

From the stacktrace it seems a channel outbuf was filled up which triggered the assert when the limit was reached. It then released the channel but the scheduler couldn't find it in the pending list so it tried to free the channel and the two last BUG() are basically that. Bottom line, that channel was never in the scheduler loop.

This is very weird. The fact that the outbuf was completely full is worrying that is:

  1. Was it filled up by then the scheduler?
  2. Or we have a bug where in KISTLite mode we didn't allow somehow to let the channel write to kernel?

For (1), currently the only possible way I could find for a channel outbuf to be used without the scheduler knowing is during the connection handshake but that shouldn't fill up of INT_MAX data...

comment:3 Changed 7 weeks ago by pastly

Here's some thoughts.

  • Isn't the outbuf limited to like 32KB in size?
  • If the scheduler was what filled up the outbuf, would Vanilla done the same thing? Why or why not? Does Vanilla check a limit that KIST/KISTLite doesn't? Since Vanilla doesn't write outbuf->kernel, wouldn't it be more likely to hit this?
  • What was the state of the channel that got a full outbuf? Was it closing? Was data continuing to pour in and for some reason we were never able to write, and thus the outbuf filled up? How many scheduling rounds did it take for this to happen? Did it all happen in one?

Bah.

comment:4 Changed 7 weeks ago by nickm

Another possibility to consider is integer underflow of some kind.

comment:5 Changed 7 weeks ago by nickm

I'm a little suspicious of this stack trace. scheduler_init() shouldn't be getting called when Tor has already been running for 10 days; what is it doing there? And since when does scheduler_init call channel_flush_some_cells or update_socket_written?

comment:6 Changed 7 weeks ago by Felixix

The number of circuits for this particular tor process was pretty high (200k) for hours compared to other relays (typ <20k).
An increase in mbuf clusters usage could be observed after one day and having set the MaxMemInQueues higher.

I changed from
MaxMemInQueues 2 GB
to
MaxMemInQueues 4 GB
service tor reload

Oct 01 08:20:52.000 [notice] Received reload signal (hup). Reloading config and resetting internal state.
Oct 01 08:20:52.000 [notice] Read configuration file "/usr/local/etc/tor/torrc". 

The last 4 minutes (on minutes log base)

top -nCSjI

                      PID     JID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
2017-10-02_02-57-00:66451       5    256        6  20    0  3050M  2984M uwait   0  71.9H  11.87% tor
2017-10-02_02-58-00:66451       5    256        6  20    0  3054M  2986M uwait   3  71.9H  11.47% tor
2017-10-02_02-59-00:66451       5    256        6  20    0  3054M  2986M uwait   2  71.9H 100.00% tor
2017-10-02_03-00-00:66451       5    256        6  20    0  3054M  2986M uwait   2  71.9H 100.00% tor

netstat -m | grep "mbuf clusters in use"

2017-10-02_02-57-00: 2508/43038/45546/2035262 mbuf clusters in use (current/cache/total/max)
2017-10-02_02-58-00: 2552/42994/45546/2035262 mbuf clusters in use (current/cache/total/max)
2017-10-02_02-59-00: 19335/26211/45546/2035262 mbuf clusters in use (current/cache/total/max)
2017-10-02_03-00-00: 33636/11910/45546/2035262 mbuf clusters in use (current/cache/total/max)

comment:7 Changed 4 weeks ago by dgoulet

Sponsor: SponsorV

Tagging all scheduler (and KIST) related tickets for SponsorV

comment:8 Changed 4 weeks ago by dgoulet

Owner: set to dgoulet
Status: newaccepted

Taking the scheduler related tickets.

comment:9 Changed 4 weeks ago by dgoulet

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Parent ID: #23993

The stacktrace is definitely not accurate.

I think the next move is to compile tor with -O0 -g and hope we can hit this again with an accurate stacktrace... Also, if we can get #23753 in 032 soon, it might help with more information.

Postponing to 033 because we can't do much. I'll parent this ticket as well with the ticket that is about refactoring the channel/scheduler code.

comment:10 Changed 2 weeks ago by Felixix

Hi everybody

It's back. Anything I can do ?

src/common/buffers.c:694: buf_add: Non-fatal assertion
src/or/scheduler_kist.c:405: update_socket_written: Non-fatal assertion
src/or/scheduler_kist.c:373: socket_can_write: Non-fatal assertion

Tor memory usage is about 5 GB according to:

MaxMemInQueues 5 GB

It's following and may-be connected to the new #24167.

Tor 0.3.2.3-alpha (git-023d756bfc04c244) running on FreeBSD with Libevent 2.1.8-stable, OpenSSL LibreSSL 2.5.4, Zlib 1.2.8, Liblzma 5.2.2, and Libzstd 1.3.2.
Nov 07 12:45:27.000 [notice] Heartbeat: Tor's uptime is 9 days 18:59 hours, with 53232 circuits open. I've sent 6686.25 GB and received 6617.58 GB.
Nov 07 12:45:27.000 [notice] Circuit handshake stats since last time: 41165/41165 TAP, 477460/477460 NTor.
Nov 07 12:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 309567 v4 connections; and received 2417 v1 connections, 51314 v2 connections, 124479 v3 connections, and 969907 v4 connections.
Nov 07 13:45:27.000 [notice] Heartbeat: Tor's uptime is 9 days 19:59 hours, with 345744 circuits open. I've sent 6721.03 GB and received 6653.43 GB.
Nov 07 13:45:27.000 [notice] Circuit handshake stats since last time: 47784/47784 TAP, 481668/481668 NTor.
Nov 07 13:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 310352 v4 connections; and received 2427 v1 connections, 51608 v2 connections, 125265 v3 connections, and 974807 v4 connections.
Nov 07 14:45:27.000 [notice] Heartbeat: Tor's uptime is 9 days 20:59 hours, with 460213 circuits open. I've sent 6756.65 GB and received 6688.85 GB.
Nov 07 14:45:27.000 [notice] Circuit handshake stats since last time: 91852/91852 TAP, 464777/464777 NTor.
Nov 07 14:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 310990 v4 connections; and received 2442 v1 connections, 51934 v2 connections, 126196 v3 connections, and 979302 v4 connections.
Nov 07 15:32:50.000 [warn] tor_bug_occurred_: Bug: src/common/buffers.c:694: buf_add: Non-fatal assertion !(buf->datalen >= INT_MAX - string_len) failed. (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug: Non-fatal assertion !(buf->datalen >= INT_MAX - string_len) failed in buf_add at src/common/buffers.c:694. Stack trace: (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x11a2cd8 <log_backtrace+0x48> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x11bd93c <tor_bug_occurred_+0x10c> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x11a446b <buf_add+0x13b> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1129402 <connection_write_to_buf_impl_+0x122> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10f2a98 <channel_tls_update_marks+0x1068> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10e7a6b <channel_write_cell_generic_+0x23b> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10e8078 <channel_write_packed_cell+0x38> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x109e1fa <channel_flush_from_first_active_circuit+0x31a> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10e88a2 <channel_flush_some_cells+0xd2> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10db2f6 <scheduler_can_use_kist+0x8e6> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10d9a22 <scheduler_init+0x152> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x801b38a96 <event_base_assert_ok_nolock_+0x926> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x801b34cfe <event_base_loop+0x50e> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1072eb5 <do_main_loop+0x575> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10752a9 <tor_main+0xe9> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1070d09 <main+0x19> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1070c01 <_start+0x1a1> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] write_to_buf failed on an orconn; notifying of error (fd 3936)
Nov 07 15:32:50.000 [warn] Scheduler asked to release channel 4099239 but it wasn't in channels_pending
Nov 07 15:32:50.000 [warn] tor_bug_occurred_: Bug: src/or/scheduler_kist.c:405: update_socket_written: Non-fatal assertion !((!ent)) failed. (Future instances of this warning will be silenced.) (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug: Non-fatal assertion !((!ent)) failed in update_socket_written at src/or/scheduler_kist.c:405. Stack trace: (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x11a2cd8 <log_backtrace+0x48> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x11bd93c <tor_bug_occurred_+0x10c> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10db390 <scheduler_can_use_kist+0x980> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10d9a22 <scheduler_init+0x152> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x801b38a96 <event_base_assert_ok_nolock_+0x926> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x801b34cfe <event_base_loop+0x50e> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1072eb5 <do_main_loop+0x575> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10752a9 <tor_main+0xe9> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1070d09 <main+0x19> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1070c01 <_start+0x1a1> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] tor_bug_occurred_: Bug: src/or/scheduler_kist.c:373: socket_can_write: Non-fatal assertion !((!ent)) failed. (Future instances of this warning will be silenced.) (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug: Non-fatal assertion !((!ent)) failed in socket_can_write at src/or/scheduler_kist.c:373. Stack trace: (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x11a2cd8 <log_backtrace+0x48> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x11bd93c <tor_bug_occurred_+0x10c> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10db5a0 <scheduler_can_use_kist+0xb90> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10d9a22 <scheduler_init+0x152> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x801b38a96 <event_base_assert_ok_nolock_+0x926> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x801b34cfe <event_base_loop+0x50e> at /usr/local/lib/libevent-2.1.so.6 (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1072eb5 <do_main_loop+0x575> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x10752a9 <tor_main+0xe9> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1070d09 <main+0x19> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:32:50.000 [warn] Bug:     0x1070c01 <_start+0x1a1> at /usr/local/bin/tor (on Tor 0.3.2.3-alpha 023d756bfc04c244)
Nov 07 15:45:27.000 [notice] Heartbeat: Tor's uptime is 9 days 21:59 hours, with 416645 circuits open. I've sent 6792.80 GB and received 6726.35 GB.
Nov 07 15:45:27.000 [notice] Circuit handshake stats since last time: 37107/37107 TAP, 467515/467515 NTor.
Nov 07 15:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 311516 v4 connections; and received 2456 v1 connections, 52297 v2 connections, 127116 v3 connections, and 983804 v4 connections.
Nov 07 16:45:27.000 [notice] Heartbeat: Tor's uptime is 9 days 23:00 hours, with 454325 circuits open. I've sent 6827.63 GB and received 6761.22 GB.
Nov 07 16:45:27.000 [notice] Circuit handshake stats since last time: 51144/51144 TAP, 453758/453758 NTor.
Nov 07 16:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 312143 v4 connections; and received 2465 v1 connections, 52689 v2 connections, 127881 v3 connections, and 988472 v4 connections.
Nov 07 17:45:27.000 [notice] Heartbeat: Tor's uptime is 9 days 23:59 hours, with 182463 circuits open. I've sent 6863.17 GB and received 6795.92 GB.
Nov 07 17:45:27.000 [notice] Circuit handshake stats since last time: 51863/51863 TAP, 443959/443960 NTor.
Nov 07 17:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 312886 v4 connections; and received 2480 v1 connections, 53077 v2 connections, 128666 v3 connections, and 993075 v4 connections.
Nov 07 18:45:27.000 [notice] Heartbeat: Tor's uptime is 10 days 1:00 hours, with 357060 circuits open. I've sent 6903.66 GB and received 6835.87 GB.
Nov 07 18:45:27.000 [notice] Circuit handshake stats since last time: 308305/308305 TAP, 451842/451843 NTor.
Nov 07 18:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 313518 v4 connections; and received 2492 v1 connections, 53508 v2 connections, 129458 v3 connections, and 997810 v4 connections.
Nov 07 19:45:27.000 [notice] Heartbeat: Tor's uptime is 10 days 1:59 hours, with 449990 circuits open. I've sent 6943.84 GB and received 6875.63 GB.
Nov 07 19:45:27.000 [notice] Circuit handshake stats since last time: 701582/701582 TAP, 481074/481074 NTor.
Nov 07 19:45:27.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 3 v3 connections, and 313975 v4 connections; and received 2504 v1 connections, 53900 v2 connections, 130209 v3 connections, and 1002495 v4 connections. 

Cheers !

comment:11 Changed 13 days ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final

More information, putting back in 032 to investigate further.

comment:12 Changed 13 days ago by dgoulet

The above stracktrace again is not accurate. scheduler_can_use_kist() can *not* possibly call channel_flush_some_cells() so I'm not sure what to look at...

However, again, these two warnings are real and seems the outbuf was full leading to this error.

Nov 07 15:32:50.000 [warn] write_to_buf failed on an orconn; notifying of error (fd 3936)
Nov 07 15:32:50.000 [warn] Scheduler asked to release channel 4099239 but it wasn't in channels_pending

And here is what I think is happening:

channel_flush_some_cells() is called inside the scheduler loop (kist_scheduler_run()) and if a connection write fails with the assert() we see in the stacktrace above, connection_or_close_for_error() is ultimately called leading to a channel_change_state() to put the channel in CHANNEL_STATE_CLOSING which will release the scheduler_release_channel() making KIST scheduler free its socket table entry.

So when channel_flush_some_cells() returns, we try to update the socket info and this is triggered:

Bug: Non-fatal assertion !((!ent)) failed in update_socket_written

Now, we do check for a cell to be actually flushed before updating the socket table with if (flush_result > 0) { but in the case of a write error, how can that be? Good question, the answer is in channel_flush_from_first_active_circuit() which ignores the result of a write cell in the outbuf and always return that we did flush the cell:

    /* Now send the cell */
    channel_write_packed_cell(chan, cell);
    cell = NULL;

    /*
     * Don't packed_cell_free_unchecked(cell) here because the channel will
     * do so when it gets out of the channel queue (probably already did, in
     * which case that was an immediate double-free bug).
     */

    /* Update the counter */
    ++n_flushed;

Ok so the good news is that this bug is benign but annoying due to the stacktrace.

comment:13 Changed 13 days ago by dgoulet

Priority: HighVery High
Status: acceptedneeds_review

See branch: bug23751_032_01.

To be clear, the above is a bandaid that I'm not happy about but for 032 I don't think we have much choice. To propagate back up the connection write error, we would need a refactoring touching many subsystems.

This again piles on the importance of #23993. The channel subsystems needs a thorough cleanup/redesign.

Last thing, this can possibly lead to a use-after-free (I'm still uncertain if that can be triggered remotely) with the KIST scheduler so I'm bumping this fix to a higher priority.

comment:14 Changed 13 days ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Mergd to 0.3.2 and forward.

Note: See TracTickets for help on using tickets.