#25125 closed defect (fixed)

KIST: Bug: scheduler_release_channel: Non-fatal assertion !(smartlist_pos(channels_pending, chan) == -1)

Reported by: dgoulet Owned by: dgoulet
Priority: High Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 032-backport, tor-sched
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Because of commit cb5654f300312a8f and adaf3e9b89f62d68 merged last night, we now see a new bug (benign as far as I can tell).

Here is the SCHED_BUG() on one of my test relay:

Feb 02 12:56:21.930 [warn] tor_bug_occurred_(): Bug: src/or/scheduler.c:648: scheduler_release_channel: Non-fatal assertion !(smartlist_pos(channels_pending, chan) == -1) failed. (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.930 [warn] Bug: Non-fatal assertion !(smartlist_pos(channels_pending, chan) == -1) failed in scheduler_release_channel at src/or/scheduler.c:648. Stack trace: (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.930 [warn] Bug:     /root/git/tor/src/or/tor(log_backtrace+0x42) [0x5579b56bb582] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.930 [warn] Bug:     /root/git/tor/src/or/tor(tor_bug_occurred_+0xb9) [0x5579b56d6919] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(scheduler_release_channel+0x1aa) [0x5579b55edb0a] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(+0x10bd94) [0x5579b563fd94] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(connection_handle_write+0x2e) [0x5579b564067e] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(+0xbb229) [0x5579b55ef229] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(+0xb8c8f) [0x5579b55ecc8f] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x819) [0x7fcbd0fe94c9] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(do_main_loop+0x25c) [0x5579b5587f0c] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(tor_run_main+0x275) [0x5579b5589475] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(tor_main+0x3a) [0x5579b5582b3a] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(main+0x19) [0x5579b55828a9] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fcbcff01830] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)
Feb 02 12:56:21.931 [warn] Bug:     /root/git/tor/src/or/tor(_start+0x29) [0x5579b55828f9] (on Tor 0.3.3.1-alpha-dev 0fd91772b1ece9e0)

It appears that we can release a channel from within connection_handle_write(). At that point, the channel was popped from the pending list and within the scheduler loop, we are writing the channel connection's outbuf onto the wire so when it is release, this bug is triggered in scheduler_release_channel():

  if (chan->scheduler_state == SCHED_CHAN_PENDING) {
    if (SCHED_BUG(smartlist_pos(channels_pending, chan) == -1, chan)) {
      log_warn(LD_SCHED, "Scheduler asked to release channel %" PRIu64 " "
                         "but it wasn't in channels_pending",
               chan->global_identifier);

This doesn't affect tor in any ways and the current codeflow of channels suggest that the release should accept that the channel is not in the pending list even if the scheduler state is in PENDING.

The reason is that while in the scheduler loop, once the channel is popped from the pending list, we *have* to keep it in PENDING state (see #24700) until the scheduler is done with it and decides to put it back in the list or change its state. This is because a channel can be rescheduled in while it is in the scheduler loop which is *not* ideal for KIST but this is the way it is for now.

We have to do a quick fix for 032 and 033 but this should be resolved in 034 with #24554.

Child Tickets

Change History (4)

comment:1 Changed 16 months ago by dgoulet

Keywords: tor-sched added
Status: assignedneeds_review

Branch: bug25125_032_01

Been on my fast test relay for more than 30 min now, no bug no nothing. While before, this was hit much faster. I'll keep an eye on it but I do think that is the fix and we have a unit test.

comment:2 Changed 16 months ago by nickm

[should this / could this] look at sched_heap_idx [instead of / in addition to] smartlist_pos()?

comment:3 Changed 16 months ago by dgoulet

Indeed and it is also faster!

See fixup commit: 6b415382651d5a22

comment:4 Changed 16 months ago by nickm

Resolution: fixed
Status: needs_reviewclosed

squashed and merged to 0.3.2 and forward!

Note: See TracTickets for help on using tickets.