Opened 5 months ago

Closed 4 months ago

#23753 closed defect (implemented)

sched: Implement a SCHED_BUG() that prints more information

Reported by: dgoulet Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-sched
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor: SponsorV

Description

Bug such as #23751 and #23676 are very difficult to diagnose or reproduce so once we hit a BUG(), we should try to extract as much information as we can from the channel and state of the scheduler.

Thus I propose SCHED_BUG().

Child Tickets

Change History (12)

comment:1 Changed 5 months ago by dgoulet

Status: newneeds_review

See branch: ticket23753_032_01.

Only 5 callsites of SCHED_BUG() for now but there are probably more places we could benefit from those.

comment:2 Changed 5 months ago by dgoulet

Oh and this is what the output looks like (I've intentionally trigger one):

Oct 03 14:34:41.153 [warn] tor_bug_occurred_(): Bug: src/or/scheduler_kist.c:119: channel_outbuf_length: Non-fatal assertion !((channel_tls_from_base((chan)))->conn != ((void *)0)) failed. (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug: Non-fatal assertion !((channel_tls_from_base((chan)))->conn != ((void *)0)) failed in channel_outbuf_length at src/or/scheduler_kist.c:119. Stack trace: (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /usr/lib/x86_64-linux-gnu/libasan.so.4(+0x55810) [0x7f0a3ea65810] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(log_backtrace+0x45) [0x5644509cdd05] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(tor_bug_occurred_+0x12c) [0x564450a2cbfc] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(+0x805d8d) [0x564450708d8d] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(+0x7fa39d) [0x5644506fd39d] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x1f8f5) [0x7f0a3e26a8f5] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7) [0x7f0a3e26b227] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(do_main_loop+0x410) [0x5644505a3680] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(tor_main+0x143d) [0x5644505a8efd] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(main+0x1c) [0x5644505948ac] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f0a3c2301c1] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] Bug:     /home/dgoulet/Documents/git/tor/src/or/tor(_start+0x2a) [0x5644505972ba] (on Tor 0.3.2.2-alpha-dev 9733102b55884039)
Oct 03 14:34:41.165 [warn] scheduler_bug_occurred(): Bug: Channel 2 in state open and scheduler state 1. Num cells on cmux: 0. Connection outbuf len: 2727. Num pending channels: 0. Channel in pending list: no (on Tor 0.3.2.2-alpha-dev 9733102b55884039)

comment:3 Changed 5 months ago by pastly

Looks good to me. I added a callsite in my ticket23753_032_01 (on gitweb this time).

Last edited 5 months ago by pastly (previous) (diff)

comment:4 Changed 5 months ago by dgoulet

Status: needs_reviewmerge_ready

ack lgtm.

pastly's branch ticket23753_032_01 is what should be considered for merge. Thanks!

comment:5 Changed 5 months ago by nickm

bcaa6b5159d67e4420e8db17a9fbfde760a69529 looks like it needs an if?

comment:6 Changed 5 months ago by nickm

Also, are we going to regret the transition away from IF_BUG_ONCE? Will this flood people's logs now?

comment:7 Changed 5 months ago by nickm

Status: merge_readyneeds_revision

comment:8 Changed 4 months ago by dgoulet

Sponsor: SponsorV

Tagging all scheduler (and KIST) related tickets for SponsorV

comment:9 Changed 4 months ago by dgoulet

Owner: set to dgoulet
Status: needs_revisionaccepted

comment:10 Changed 4 months ago by dgoulet

Status: acceptedneeds_revision

comment:11 in reply to:  6 Changed 4 months ago by dgoulet

Status: needs_revisionneeds_review

Replying to nickm:

Also, are we going to regret the transition away from IF_BUG_ONCE? Will this flood people's logs now?

(First, fixup commit added for comment:5)

I don't think so. All of them are highly unlikely to happen but if they do, the channel shouldn't get stuck in the KIST scheduler thus not being called over and over again.

Actually, that made me think about it and I think the IF_BUG_ONCE() is not that helpful that is if we hit it once, it is a problem but if we hit a bug non stop, we learn something more, which that something is stuck and that gives us a hint.

So that being said, I've added a ratelimit to the log_warn in scheduler_bug_occurred(). Let me know what you think.

Branch: ticket23753_032_02

comment:12 Changed 4 months ago by nickm

Resolution: implemented
Status: needs_reviewclosed

squashed and merging!

Note: See TracTickets for help on using tickets.