Bug such as #23751 (moved) and #23676 (moved) are very difficult to diagnose or reproduce so once we hit a BUG(), we should try to extract as much information as we can from the channel and state of the scheduler.
Thus I propose SCHED_BUG().
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Oh and this is what the output looks like (I've intentionally trigger one):
Oct 03 14:34:41.153 [warn] tor_bug_occurred_(): Bug: src/or/scheduler_kist.c:119: channel_outbuf_length: Non-fatal assertion !((channel_tls_from_base((chan)))->conn != ((void *)0)) failed. (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: Non-fatal assertion !((channel_tls_from_base((chan)))->conn != ((void *)0)) failed in channel_outbuf_length at src/or/scheduler_kist.c:119. Stack trace: (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /usr/lib/x86_64-linux-gnu/libasan.so.4(+0x55810) [0x7f0a3ea65810] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(log_backtrace+0x45) [0x5644509cdd05] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(tor_bug_occurred_+0x12c) [0x564450a2cbfc] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(+0x805d8d) [0x564450708d8d] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(+0x7fa39d) [0x5644506fd39d] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x1f8f5) [0x7f0a3e26a8f5] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7) [0x7f0a3e26b227] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(do_main_loop+0x410) [0x5644505a3680] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(tor_main+0x143d) [0x5644505a8efd] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(main+0x1c) [0x5644505948ac] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f0a3c2301c1] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] Bug: /home/dgoulet/Documents/git/tor/src/or/tor(_start+0x2a) [0x5644505972ba] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)Oct 03 14:34:41.165 [warn] scheduler_bug_occurred(): Bug: Channel 2 in state open and scheduler state 1. Num cells on cmux: 0. Connection outbuf len: 2727. Num pending channels: 0. Channel in pending list: no (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Also, are we going to regret the transition away from IF_BUG_ONCE? Will this flood people's logs now?
(First, fixup commit added for comment:5)
I don't think so. All of them are highly unlikely to happen but if they do, the channel shouldn't get stuck in the KIST scheduler thus not being called over and over again.
Actually, that made me think about it and I think the IF_BUG_ONCE() is not that helpful that is if we hit it once, it is a problem but if we hit a bug non stop, we learn something more, which that something is stuck and that gives us a hint.
So that being said, I've added a ratelimit to the log_warn in scheduler_bug_occurred(). Let me know what you think.