Opened 3 weeks ago

Last modified 5 days ago

#32868 needs_information defect

crash: Assertion node->rs->is_possible_guard failed in compute_weighted_bandwidths at

Reported by: toralf Owned by: nickm
Priority: High Milestone: Tor: 0.4.3.x-final
Component: Core Tor/Tor Version: Tor: 0.4.2.5
Severity: Normal Keywords: regression crash 035-backport, 040-backport, 041-backport, 042-backport, BugSmashFund 043-can
Cc: mikeperry Actual Points: 0.6
Parent ID: Points: 1
Reviewer: dgoulet Sponsor:

Description

With https://github.com/toralf/torutils/blob/f307866d2149bcd3c95df5269b3b48d051f871e9/info.py I monitor the status of my 2 Tor relays. A day before I commented out this section in torrc:

#FetchDirInfoEarly 1
#FetchDirInfoExtraEarly 1
#FetchUselessDescriptors 1
#UseMicrodescriptors 0
#DownloadExtraInfo 1

Now it took 7 minutes to finish (when those config values were set it was much faster).

But more interesting I realized today I crash of one of both Tor relays:

Jan 01 11:12:38.000 [warn] compute_weighted_bandwidths(): Bug: Consensus is missing some bandwidths. Using a naive router selection algorithm (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] tor_assertion_failed_(): Bug: src/feature/nodelist/node_select.c:723: compute_weighted_bandwidths: Assertion node->rs->is_possible_guard failed; aborting. (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug: Tor 0.4.2.5: Assertion node->rs->is_possible_guard failed in compute_weighted_bandwidths at src/feature/nodelist/node_select.c:723: . Stack trace: (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(log_backtrace_impl+0x59) [0x55850995e1a9] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(tor_assertion_failed_+0x150) [0x5585099593e0] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(+0x12c474) [0x558509896474] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(frac_nodes_with_descriptors+0x73) [0x558509897373] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(+0x12afb2) [0x558509894fb2] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(count_loading_descriptors_progress+0x6c) [0x558509895b2c] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(connection_dir_reached_eof+0x20e0) [0x558509856f30] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(connection_handle_read+0x980) [0x5585097d2760] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(+0x6d97c) [0x5585097d797c] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/lib64/libevent-2.1.so.6(+0x219ca) [0x7ffbba3cb9ca] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/lib64/libevent-2.1.so.6(event_base_loop+0x4ef) [0x7ffbba3cc54f] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(do_main_loop+0xdd) [0x5585097d8b2d] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(tor_run_main+0x10c5) [0x5585097c6855] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(tor_main+0x46) [0x5585097c4076] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(main+0x19) [0x5585097c3c49] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /lib64/libc.so.6(__libc_start_main+0xeb) [0x7ffbb9d19eab] (on Tor 0.4.2.5 )
Jan 01 11:12:38.000 [err] Bug:     /usr/bin/tor(_start+0x2a) [0x5585097c3c9a] (on Tor 0.4.2.5 )

System is a hardened Gentoo Linux running 2 relays. The second is not affected.

Child Tickets

TicketStatusOwnerSummaryComponent
#32961merge_readyteorBackport the diagnostic logs for is_possible_guard crashCore Tor/Tor

Change History (16)

comment:1 Changed 3 weeks ago by nickm

Keywords: regression crash added
Milestone: Tor: 0.4.2.x-final
Priority: MediumHigh

comment:2 Changed 2 weeks ago by nickm

Cc: mikeperry added

Did the crash happen in response to running the info.py script, or did it happen a little after?

Can you include the rest of your configuration file as well, in case it has something that would help reproduce or explain this bug?

The simplest solution here would be to use IF_BUG_ONCE instead of tor_assert(), so that the error message wouldn't come with a crash. But really, I'd rather find out what the bug is and fix it directly.

Adding mikeperry to the ccs, since I believe he knows this code well.

comment:3 in reply to:  2 Changed 2 weeks ago by toralf

Replying to nickm:

Did the crash happen in response to running the info.py script, or did it happen a little after?

A little after. I just mentioned it b/c I do not know if pressing Ctrl-C at a runninf info.py script could had affected any cached files in Tor's data directory.

Can you include the rest of your configuration file as well, in case it has something that would help reproduce or explain this bug?

Sure, here you go:

mr-fox /etc/tor # cat torrc
# torrc
#
PIDFile /var/run/tor/tor.pid
DataDirectory /var/lib/tor/data

Nickname zwiebeltoralf
Address 5.9.158.75
OutboundBindAddress 5.9.158.75
OutboundBindAddress [2a01:4f8:190:514a::2]
DirPort 80
ORPort  443
DirPort [2a01:4f8:190:514a::2]:80   NoAdvertise
ORPort  [2a01:4f8:190:514a::2]:443

ControlPort 9051

Log notice file /tmp/notice.log
Log warn   file /var/log/tor/warn.log

%include /etc/tor/torrc.d/

mr-fox /etc/tor # cat torrc.d/*
User tor

SandBox 1

OfflineMasterKey 1

MyFamily 63BF46A63F9C21FD315CD061B3EAA3EB05283A0A,509EAB4C5D10C9A9A24B4EA0CE402C047A2D64E6

ContactInfo replace k with c : kontakt @ zwiebeltoralf . de
DirPortFrontPage /etc/tor/tor-exit-notice_DE.html

# used by Nyx
#
CookieAuthentication 1

# https://stem.torproject.org/tutorials/mirror_mirror_on_the_wall.html
#
FetchDirInfoEarly 1
FetchDirInfoExtraEarly 1
FetchUselessDescriptors 1
UseMicrodescriptors 0
DownloadExtraInfo 1

SocksPort 0

AvoidDiskWrites 1

ExitRelay 1
IPv6Exit 1

ExitPolicy reject     1.1.1.1:53
ExitPolicy reject     4.2.2.2:53
ExitPolicy reject     8.8.8.8:53
ExitPolicy reject     9.9.9.9:53
ExitPolicy reject     9.9.9.10:53

ExitPolicy reject     108.166.183.219/32:53
ExitPolicy reject     168.156.8.90/24:53
ExitPolicy reject     184.105.193.73/32:53

ExitPolicy reject     193.70.85.1:7777

# https://trac.torproject.org/projects/tor/wiki/doc/ReducedExitPolicy
ExitPolicy accept *:53        # DNS
ExitPolicy accept *:706       # SILC
ExitPolicy accept *:853       # DNS over TLS
ExitPolicy accept *:1194      # OpenVPN
ExitPolicy accept *:1533      # Sametime
ExitPolicy accept *:5222      # XMPP over SSL
ExitPolicy accept *:5223      # XMPP
ExitPolicy accept *:5269      # ejabberd s2s_in
ExitPolicy accept *:5280      # ejabberd admin
ExitPolicy accept *:6660-6669 # IRC
ExitPolicy accept *:6679      # IRC SSL
ExitPolicy accept *:6697      # IRC SSL
ExitPolicy accept *:7777      # ejabberd file service
ExitPolicy accept *:8074      # Gadu-Gadu
ExitPolicy accept *:8232-8233 # Zcash
ExitPolicy accept *:8332-8333 # Bitcoin
ExitPolicy accept *:9418      # git
ExitPolicy accept *:11371     # OpenPGP hkp (http keyserver protocol)
ExitPolicy accept *:50002     # Electrum Bitcoin SSL
ExitPolicy accept *:64738     # Mumble

ExitPolicy reject *:*

and

mr-fox /etc/tor # cat /etc/conf.d/tor
#
# Set the file limit
rc_ulimit="-n 50000"

torrc=/etc/tor/torrc

comment:4 Changed 6 days ago by nickm

Okay, after a few hours of searching through the code I still don't know what's going on.

We can only reach the assertion failure when a routerstatus entry in the consensus we're using has a guardfraction set on a node that is not in fact is_possible_guard.

But whenever we're parsing a consensus, we only set has_guardfraction when is_possible_guard is true.

At first I thought that the issue here might that we were looking at a vote instead of a consensus, and I searched for a long time for a way that this non-authority relay might be downloading votes and getting confused about how to treat them. I couldn't find such a way.

But instead, we could be clearing routerstatus_t.is_possible_guard after it was first set. But I can't find a place where we do that on a live routerstatus either.

There is probably something I'm missing here. I'll keep looking.

comment:5 Changed 6 days ago by nickm

Is it possible that AuthoritativeDir or V3AuthoritativeDir was set on this relay?

comment:6 Changed 6 days ago by nickm

Actual Points: 0.5
Points: 1

comment:7 in reply to:  5 Changed 6 days ago by toralf

Replying to nickm:

Is it possible that AuthoritativeDir or V3AuthoritativeDir was set on this relay?

No, not at all, the config isn't changed at all since a longer time (except maybe the exit policy)

comment:8 Changed 6 days ago by nickm

Actual Points: 0.50.6
Owner: set to nickm
Status: newaccepted

Okay, I'm pretty well stymied. I'm not sure where this RS is coming from, or how it's getting in this state. I think my best bet for now is to write a diagnostic patch to disable the assertion here, and instead log what we can find out about the routerstatus.

Branch is: diagnose_32868_035
0.3.5 PR: https://github.com/torproject/tor/pull/1661
master PR: (for CI) https://github.com/torproject/tor/pull/1662

This won't fix the underlying problem, but I think we should consider merging it to help track the problem down.

comment:9 Changed 6 days ago by nickm

Status: acceptedneeds_review

comment:10 Changed 6 days ago by dgoulet

Reviewer: dgoulet
Status: needs_reviewmerge_ready

lgtm; CI still running...

comment:11 in reply to:  8 Changed 6 days ago by toralf

Replying to nickm:

0.3.5 PR: https://github.com/torproject/tor/pull/1661
This won't fix the underlying problem, but I think we should consider merging it to help track the problem down.

I patched the 0.4.2.5 Tor sources here at this Gentoo Linux system and restartet both Tor relays.
I observed a jump of the value "cswch/s (Total number of context switches per second.)" - collected by sysstat - from 10,000 to 20,000 at this system. I do wonder if this is due to the debug code or completely unrelated?

comment:12 Changed 6 days ago by nickm

Almost certainly it's unrelated.

comment:13 Changed 5 days ago by teor

Keywords: 035-backport 040-backport 041-backport 042-backport added
Milestone: Tor: 0.4.2.x-finalTor: 0.4.3.x-final
Status: merge_readyneeds_information

CI passed, merging the diagnostic to master.

Nick, how far did you want to backport the diagnostic, and how soon?
I'll open a child ticket to track the diagnostic backport.

comment:14 Changed 5 days ago by teor

Keywords: BugSmashFund added

comment:15 Changed 5 days ago by nickm

Let's try it 042 if it doesn't break anything in master.

comment:16 Changed 5 days ago by nickm

Keywords: 043-can added
Note: See TracTickets for help on using tickets.