Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#5635 closed defect (fixed)

Hundreds of thousands of "While fetching directory info, no running dirservers known"

Reported by: ln5 Owned by:
Priority: High Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-auth
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

maatuska has seen 191680 lines with "[notice] While fetching directory
info, no running dirservers known. Will try again later. (purpose 5)"
so far today (in about 15h).

Child Tickets

Change History (13)

comment:1 Changed 8 years ago by nickm

Milestone: Tor: 0.2.3.x-final
Priority: normalmajor

Hm. The purpose is "fetch v2 networkstatus" and the function is directory_get_from_dirserver. That's getting invoked via update_v2_networkstatus_cache_downloads, or possibly via connection_dir_download_v2_networkstatus_failed.

I wonder if it's crucial here that maatuska is an authority, but not a v2 authority.

Looking at the code for update_v2_networkstatus_cache_downloads , it seems to be correctly rate-limiting itself with the "if (last_networkstatus_download_attempted + interval >= now)" check.

connection_dir_download_v2_networkstatus_failed can only get called when a download that makes it as far as getting a connection fails. And so far as I can tell, in the case where this warning appears, there is no actual connection made.

More investigation needed; a stack trace would help a lot.

comment:2 Changed 8 years ago by ln5

I'm unable to produce a backtrace at this machine, at least now.

An extra data point: Sending USR1 to tor reveals 2.7k "Directory
connections". Sorting by number of connections, the top 10 looks like
this:

63 [ip8]:9030.
86 [ip7]:9030.
91 [ip6]:9030.

132 [ip5]:9030.
205 [ip4]:9030.
228 [ip3]:80.
386 [ip2]:0.
386 [ip2]:1.
464 [ip1]:80.
674 [ip0]:9030.

Note the port numbers 0 and 1. They are to the same IP address.

comment:3 Changed 8 years ago by nickm

Are any of those IPs directory authorities? (That includes the old ides IP.) Does the USR1 dump tell you connection purpose?

comment:4 Changed 8 years ago by ln5

One each to dizum and gabelmoo.

Yes, it says things like

[notice] Conn 3572 (socket 2789) type 9 (Directory), state 3 (client reading), created 63 secs ago

and

[notice] Conn 3581 (socket 949) type 4 (OR), state 8 (open), created 10 secs ago

comment:5 Changed 8 years ago by nickm

Actually, that's type and state, not purpose. Fooey.

Sebastian reports some more interesting log messages at info:

10:58 < nickm> So, something is calling directory_get_from_dirserver over and over and over.
10:58 < nickm> And passing purpose == DIR_PURPOSE_FETCH_V2_NETWORKSTATUS to it
10:58 < nickm> there are only a couple of places that could happen, I think.
10:58 < nickm> (Please double-check me on all of this if you can, btw)
11:02 < nickm> The first place is indeed update_v2_networkstatus_cache_downloads, but unless I am missing some obvious sign error, that one should get rate-limited to happen not too often.
11:02 < nickm> the second possible place seems to be connection_dir_download_v2_networkstatus_failed.
11:03 < nickm> Assuming that I'm right about everything above, it has to be  connection_dir_download_v2_networkstatus_failed that's going crazy.
11:05 < nickm> That one can get called when a v2 networkstatus fetch gives us a non-200 status, or when a directory connection for a v2 networkstatus "fails".
11:07 < Sebastian> I have info logs
11:07 < Sebastian> in case they might help
11:07 < nickm> (IOW, when it's about to close without having got an answer (state < DIR_CONN_STATE_CLIENT_FINISHED), or when connection_connect() says -1.
11:07 < Sebastian> (I am in a lecture so I cannot really follow your explanations or the code in too close detail. I will do that when I'm out of here)
11:08 < nickm> Sebastian: ok
11:09 < nickm> Sebastian: it would help to know if there are 2-5 things in the log immediately preceding each "while fetching directory info..." message
11:19 < Sebastian> Apr 17 11:53:31.000 [warn] Weighted bandwidth is 0.000000 in node selection for rule weight as directory
11:20 < Sebastian> repeated 5 times
11:20 < Sebastian> nickm: ^
11:20 < nickm> Sebastian: seeing any "No roter found for "...
11:20 < nickm> *router
11:20 < nickm> or any "Deferring serverdesc fetch: all authorities are in use"
11:20 < Sebastian> karsten@tor-dirauth:~$ tail -n 1000000 gabelmoo/notice.log| grep "No router found for" | head
11:21 < Sebastian> karsten@tor-dirauth:~$ tail -n 1000000 gabelmoo/notice.log| grep "Deferring" | head
11:21 < Sebastian> both no output
11:21 < Sebastian> oh let me use info log
11:21 < Sebastian> sorry
11:21 < Sebastian> Apr 17 13:12:36.000 [info] directory_get_from_dirserver(): No router found for network-status fetch; falling back to dirserver list.
11:21 < nickm> hmmmmm
11:21 < Sebastian> here"
11:22 < Sebastian> :
11:22 < Sebastian> Apr 17 13:12:36.000 [info] router_pick_published_address(): Success: chose address '212.112.245.170'.
11:22 < Sebastian> Apr 17 13:12:36.000 [info] connection_dir_request_failed(): Giving up on directory server at '193.23.244.244'; retrying
11:22 < Sebastian> Apr 17 13:12:36.000 [info] connection_dir_client_reached_eof(): Received networkstatus objects (size 0) from server '85.24.215.13:9030'
11:22 < nickm> Is that happening all the time (or a lot of the time) before the repeated-message-in-question ?
11:22 < Sebastian> Apr 17 13:12:36.000 [info] router_pick_published_address(): Success: chose address '212.112.245.170'.
11:22 < Sebastian> Apr 17 13:12:36.000 [info] directory_get_from_dirserver(): No router found for network-status fetch; falling back to dirserver list.
11:22 < Sebastian> Apr 17 13:12:36.000 [notice] While fetching directory info, no running dirservers known. Will try again later. (purpose 5)
11:23 < Sebastian> yes
11:23 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep -B 5 While | grep "No router found for" | wc -l
11:23 < Sebastian> 50943
11:23 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep  While| wc -l50915
11:23 < nickm> oh heck.  There's the bug then.
11:24 < nickm> I think.
11:24 < nickm> Maybe
11:24 < nickm> or wait.
11:25 < nickm> yeah.  It launches a request, the requests don't get fulfilled for a few times; it marks all the v2_ns_dl_status thingies as failed so it should retry later ...
11:26 < nickm> no, that still doesn't really make sense.
11:27 < nickm> connection_dir_download_v2_networkstatus_failed just shouldn't be getting called unless there is some connection getting launched.
11:32 < nickm> Sebastian: Is the "received networkstatus objects (size 0)" also repeated a lot?  (The "size 0" specifically feels important)
11:34 < nickm> I note that 85.24.215.13 doesn't appear to be an authority; we shouldn't be asking them for v2 networkstatuses!
11:34 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep  'received networkstatus' | wc -l
11:34 < Sebastian> 0
11:34 < nickm> capital R ?
11:35 < Sebastian> yup
11:35 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep  'Received networkstatus' | grep "size 0" | wc -l
11:35 < Sebastian> 38867
11:35 < Sebastian> so, not as often
11:35 < Sebastian> but often
11:36 < nickm> Mind if I copy and paste our conversation onto the bug?
11:37 < Sebastian> go right ahead!

So my current guess is that we seem to be in a loop where after we have failed to get a v2 networkstatus from any of the authorities, we ask a non-authority for the v2 networkstatus. That fails, and we ask again. Over and over.

comment:6 Changed 8 years ago by nickm

Status: newneeds_review

Possible fix in branch bug5635. Needs review! Probably needs testing too, but review is also crucial. Somebody should also walk through whether my analysis of this bug is correct.

Note to self: the

            get_via_tor = 1; /* last resort: try routing it via Tor */

later in that function seems vestigial; I don't think it can ever result in anything being fetched. I should open another bug for that.

comment:7 Changed 8 years ago by Sebastian

First comment is that this branch is built on top of a branch for #5095, that's probably a mistake, yes? Especially because that branch contains an unsquashed fixup! commit.

I buy your logic here, but I'll want to look it over once again tomorrow to see if we can end up calling the function over and over here. I'll try the patch in the meantime regardless, because I'm convinced it's correct currently.

comment:8 Changed 8 years ago by Sebastian

(And by try the patch I mean use master with the top commit of your branch cherry-picked)

comment:9 Changed 8 years ago by arma

Patch looks plausible.

comment:10 Changed 8 years ago by nickm

Pushed a less crazy version (same patch, cherry-picked to master) at bug5635_v2.

comment:11 in reply to:  9 Changed 8 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Replying to arma:

Patch looks plausible.

Great; merging it.

comment:12 Changed 7 years ago by nickm

Keywords: tor-auth added

comment:13 Changed 7 years ago by nickm

Component: Tor Directory AuthorityTor
Note: See TracTickets for help on using tickets.