Hundreds of thousands of "While fetching directory info, no running dirservers known"

changed milestone to %Tor: 0.2.3.x-final

added component::core tor/tor milestone::Tor: 0.2.3.x-final priority::high resolution::fixed status::closed tor-auth type::defect labels

Hm. The purpose is "fetch v2 networkstatus" and the function is directory_get_from_dirserver. That's getting invoked via update_v2_networkstatus_cache_downloads, or possibly via connection_dir_download_v2_networkstatus_failed.

I wonder if it's crucial here that maatuska is an authority, but not a v2 authority.

Looking at the code for update_v2_networkstatus_cache_downloads , it seems to be correctly rate-limiting itself with the "if (last_networkstatus_download_attempted + interval >= now)" check.

connection_dir_download_v2_networkstatus_failed can only get called when a download that makes it as far as getting a connection fails. And so far as I can tell, in the case where this warning appears, there is no actual connection made.

More investigation needed; a stack trace would help a lot.

Trac:
Milestone: N/A to Tor: 0.2.3.x-final
Priority: normal to major

I'm unable to produce a backtrace at this machine, at least now.

An extra data point: Sending USR1 to tor reveals 2.7k "Directory connections". Sorting by number of connections, the top 10 looks like this:

63 [ip8]:9030. 86 [ip7]:9030. 91 [ip6]:9030. 132 [ip5]:9030. 205 [ip4]:9030. 228 [ip3]:80. 386 [ip2]:0. 386 [ip2]:1. 464 [ip1]:80. 674 [ip0]:9030.

Note the port numbers 0 and 1. They are to the same IP address.

Are any of those IPs directory authorities? (That includes the old ides IP.) Does the USR1 dump tell you connection purpose?

One each to dizum and gabelmoo.

Yes, it says things like

[notice] Conn 3572 (socket 2789) type 9 (Directory), state 3 (client reading), created 63 secs ago and [notice] Conn 3581 (socket 949) type 4 (OR), state 8 (open), created 10 secs ago

Actually, that's type and state, not purpose. Fooey.

Sebastian reports some more interesting log messages at info:

10:58 < nickm> So, something is calling directory_get_from_dirserver over and over and over.
10:58 < nickm> And passing purpose == DIR_PURPOSE_FETCH_V2_NETWORKSTATUS to it
10:58 < nickm> there are only a couple of places that could happen, I think.
10:58 < nickm> (Please double-check me on all of this if you can, btw)
11:02 < nickm> The first place is indeed update_v2_networkstatus_cache_downloads, but unless I am missing some obvious sign error, that one should get rate-limited to happen not too often.
11:02 < nickm> the second possible place seems to be connection_dir_download_v2_networkstatus_failed.
11:03 < nickm> Assuming that I'm right about everything above, it has to be  connection_dir_download_v2_networkstatus_failed that's going crazy.
11:05 < nickm> That one can get called when a v2 networkstatus fetch gives us a non-200 status, or when a directory connection for a v2 networkstatus "fails".
11:07 < Sebastian> I have info logs
11:07 < Sebastian> in case they might help
11:07 < nickm> (IOW, when it's about to close without having got an answer (state < DIR_CONN_STATE_CLIENT_FINISHED), or when connection_connect() says -1.
11:07 < Sebastian> (I am in a lecture so I cannot really follow your explanations or the code in too close detail. I will do that when I'm out of here)
11:08 < nickm> Sebastian: ok
11:09 < nickm> Sebastian: it would help to know if there are 2-5 things in the log immediately preceding each "while fetching directory info..." message
11:19 < Sebastian> Apr 17 11:53:31.000 [warn] Weighted bandwidth is 0.000000 in node selection for rule weight as directory
11:20 < Sebastian> repeated 5 times
11:20 < Sebastian> nickm: ^
11:20 < nickm> Sebastian: seeing any "No roter found for "...
11:20 < nickm> *router
11:20 < nickm> or any "Deferring serverdesc fetch: all authorities are in use"
11:20 < Sebastian> karsten@tor-dirauth:~$ tail -n 1000000 gabelmoo/notice.log| grep "No router found for" | head
11:21 < Sebastian> karsten@tor-dirauth:~$ tail -n 1000000 gabelmoo/notice.log| grep "Deferring" | head
11:21 < Sebastian> both no output
11:21 < Sebastian> oh let me use info log
11:21 < Sebastian> sorry
11:21 < Sebastian> Apr 17 13:12:36.000 [info] directory_get_from_dirserver(): No router found for network-status fetch; falling back to dirserver list.
11:21 < nickm> hmmmmm
11:21 < Sebastian> here"
11:22 < Sebastian> :
11:22 < Sebastian> Apr 17 13:12:36.000 [info] router_pick_published_address(): Success: chose address '212.112.245.170'.
11:22 < Sebastian> Apr 17 13:12:36.000 [info] connection_dir_request_failed(): Giving up on directory server at '193.23.244.244'; retrying
11:22 < Sebastian> Apr 17 13:12:36.000 [info] connection_dir_client_reached_eof(): Received networkstatus objects (size 0) from server '85.24.215.13:9030'
11:22 < nickm> Is that happening all the time (or a lot of the time) before the repeated-message-in-question ?
11:22 < Sebastian> Apr 17 13:12:36.000 [info] router_pick_published_address(): Success: chose address '212.112.245.170'.
11:22 < Sebastian> Apr 17 13:12:36.000 [info] directory_get_from_dirserver(): No router found for network-status fetch; falling back to dirserver list.
11:22 < Sebastian> Apr 17 13:12:36.000 [notice] While fetching directory info, no running dirservers known. Will try again later. (purpose 5)
11:23 < Sebastian> yes
11:23 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep -B 5 While | grep "No router found for" | wc -l
11:23 < Sebastian> 50943
11:23 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep  While| wc -l50915
11:23 < nickm> oh heck.  There's the bug then.
11:24 < nickm> I think.
11:24 < nickm> Maybe
11:24 < nickm> or wait.
11:25 < nickm> yeah.  It launches a request, the requests don't get fulfilled for a few times; it marks all the v2_ns_dl_status thingies as failed so it should retry later ...
11:26 < nickm> no, that still doesn't really make sense.
11:27 < nickm> connection_dir_download_v2_networkstatus_failed just shouldn't be getting called unless there is some connection getting launched.
11:32 < nickm> Sebastian: Is the "received networkstatus objects (size 0)" also repeated a lot?  (The "size 0" specifically feels important)
11:34 < nickm> I note that 85.24.215.13 doesn't appear to be an authority; we shouldn't be asking them for v2 networkstatuses!
11:34 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep  'received networkstatus' | wc -l
11:34 < Sebastian> 0
11:34 < nickm> capital R ?
11:35 < Sebastian> yup
11:35 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep  'Received networkstatus' | grep "size 0" | wc -l
11:35 < Sebastian> 38867
11:35 < Sebastian> so, not as often
11:35 < Sebastian> but often
11:36 < nickm> Mind if I copy and paste our conversation onto the bug?
11:37 < Sebastian> go right ahead!

So my current guess is that we seem to be in a loop where after we have failed to get a v2 networkstatus from any of the authorities, we ask a non-authority for the v2 networkstatus. That fails, and we ask again. Over and over.

Possible fix in branch bug5635. Needs review! Probably needs testing too, but review is also crucial. Somebody should also walk through whether my analysis of this bug is correct.

Note to self: the

            get_via_tor = 1; /* last resort: try routing it via Tor */

later in that function seems vestigial; I don't think it can ever result in anything being fetched. I should open another bug for that.

Trac:
Status: new to needs_review

First comment is that this branch is built on top of a branch for #5095 (moved), that's probably a mistake, yes? Especially because that branch contains an unsquashed fixup! commit.

I buy your logic here, but I'll want to look it over once again tomorrow to see if we can end up calling the function over and over here. I'll try the patch in the meantime regardless, because I'm convinced it's correct currently.

(And by try the patch I mean use master with the top commit of your branch cherry-picked)

Patch looks plausible.

Pushed a less crazy version (same patch, cherry-picked to master) at bug5635_v2.

Replying to arma:

Patch looks plausible.

Great; merging it.

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

Trac:
Keywords: N/A deleted, tor-auth added

Trac:
Component: Tor Directory Authority to Tor

closed

moved to tpo/core/tor#5635 (closed)

Hundreds of thousands of "While fetching directory info, no running dirservers known"

Child items 0

Activity