maatuska has seen 191680 lines with "[notice] While fetching directory
info, no running dirservers known. Will try again later. (purpose 5)"
so far today (in about 15h).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Hm. The purpose is "fetch v2 networkstatus" and the function is directory_get_from_dirserver. That's getting invoked via update_v2_networkstatus_cache_downloads, or possibly via connection_dir_download_v2_networkstatus_failed.
I wonder if it's crucial here that maatuska is an authority, but not a v2 authority.
Looking at the code for update_v2_networkstatus_cache_downloads , it seems to be correctly rate-limiting itself with the "if (last_networkstatus_download_attempted + interval >= now)" check.
connection_dir_download_v2_networkstatus_failed can only get called when a download that makes it as far as getting a connection fails. And so far as I can tell, in the case where this warning appears, there is no actual connection made.
More investigation needed; a stack trace would help a lot.
Trac: Milestone: N/Ato Tor: 0.2.3.x-final Priority: normal to major
[notice] Conn 3572 (socket 2789) type 9 (Directory), state 3 (client reading), created 63 secs ago
and
[notice] Conn 3581 (socket 949) type 4 (OR), state 8 (open), created 10 secs ago
Actually, that's type and state, not purpose. Fooey.
Sebastian reports some more interesting log messages at info:
10:58 < nickm> So, something is calling directory_get_from_dirserver over and over and over.10:58 < nickm> And passing purpose == DIR_PURPOSE_FETCH_V2_NETWORKSTATUS to it10:58 < nickm> there are only a couple of places that could happen, I think.10:58 < nickm> (Please double-check me on all of this if you can, btw)11:02 < nickm> The first place is indeed update_v2_networkstatus_cache_downloads, but unless I am missing some obvious sign error, that one should get rate-limited to happen not too often.11:02 < nickm> the second possible place seems to be connection_dir_download_v2_networkstatus_failed.11:03 < nickm> Assuming that I'm right about everything above, it has to be connection_dir_download_v2_networkstatus_failed that's going crazy.11:05 < nickm> That one can get called when a v2 networkstatus fetch gives us a non-200 status, or when a directory connection for a v2 networkstatus "fails".11:07 < Sebastian> I have info logs11:07 < Sebastian> in case they might help11:07 < nickm> (IOW, when it's about to close without having got an answer (state < DIR_CONN_STATE_CLIENT_FINISHED), or when connection_connect() says -1.11:07 < Sebastian> (I am in a lecture so I cannot really follow your explanations or the code in too close detail. I will do that when I'm out of here)11:08 < nickm> Sebastian: ok11:09 < nickm> Sebastian: it would help to know if there are 2-5 things in the log immediately preceding each "while fetching directory info..." message11:19 < Sebastian> Apr 17 11:53:31.000 [warn] Weighted bandwidth is 0.000000 in node selection for rule weight as directory11:20 < Sebastian> repeated 5 times11:20 < Sebastian> nickm: ^11:20 < nickm> Sebastian: seeing any "No roter found for "...11:20 < nickm> *router11:20 < nickm> or any "Deferring serverdesc fetch: all authorities are in use"11:20 < Sebastian> karsten@tor-dirauth:~$ tail -n 1000000 gabelmoo/notice.log| grep "No router found for" | head11:21 < Sebastian> karsten@tor-dirauth:~$ tail -n 1000000 gabelmoo/notice.log| grep "Deferring" | head11:21 < Sebastian> both no output11:21 < Sebastian> oh let me use info log11:21 < Sebastian> sorry11:21 < Sebastian> Apr 17 13:12:36.000 [info] directory_get_from_dirserver(): No router found for network-status fetch; falling back to dirserver list.11:21 < nickm> hmmmmm11:21 < Sebastian> here"11:22 < Sebastian> :11:22 < Sebastian> Apr 17 13:12:36.000 [info] router_pick_published_address(): Success: chose address '212.112.245.170'.11:22 < Sebastian> Apr 17 13:12:36.000 [info] connection_dir_request_failed(): Giving up on directory server at '193.23.244.244'; retrying11:22 < Sebastian> Apr 17 13:12:36.000 [info] connection_dir_client_reached_eof(): Received networkstatus objects (size 0) from server '85.24.215.13:9030'11:22 < nickm> Is that happening all the time (or a lot of the time) before the repeated-message-in-question ?11:22 < Sebastian> Apr 17 13:12:36.000 [info] router_pick_published_address(): Success: chose address '212.112.245.170'.11:22 < Sebastian> Apr 17 13:12:36.000 [info] directory_get_from_dirserver(): No router found for network-status fetch; falling back to dirserver list.11:22 < Sebastian> Apr 17 13:12:36.000 [notice] While fetching directory info, no running dirservers known. Will try again later. (purpose 5)11:23 < Sebastian> yes11:23 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep -B 5 While | grep "No router found for" | wc -l11:23 < Sebastian> 5094311:23 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep While| wc -l5091511:23 < nickm> oh heck. There's the bug then.11:24 < nickm> I think.11:24 < nickm> Maybe11:24 < nickm> or wait.11:25 < nickm> yeah. It launches a request, the requests don't get fulfilled for a few times; it marks all the v2_ns_dl_status thingies as failed so it should retry later ...11:26 < nickm> no, that still doesn't really make sense.11:27 < nickm> connection_dir_download_v2_networkstatus_failed just shouldn't be getting called unless there is some connection getting launched.11:32 < nickm> Sebastian: Is the "received networkstatus objects (size 0)" also repeated a lot? (The "size 0" specifically feels important)11:34 < nickm> I note that 85.24.215.13 doesn't appear to be an authority; we shouldn't be asking them for v2 networkstatuses!11:34 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep 'received networkstatus' | wc -l11:34 < Sebastian> 011:34 < nickm> capital R ?11:35 < Sebastian> yup11:35 < Sebastian> $ tail -n 1000000 gabelmoo/info.log| grep 'Received networkstatus' | grep "size 0" | wc -l11:35 < Sebastian> 3886711:35 < Sebastian> so, not as often11:35 < Sebastian> but often11:36 < nickm> Mind if I copy and paste our conversation onto the bug?11:37 < Sebastian> go right ahead!
So my current guess is that we seem to be in a loop where after we have failed to get a v2 networkstatus from any of the authorities, we ask a non-authority for the v2 networkstatus. That fails, and we ask again. Over and over.
Possible fix in branch bug5635. Needs review! Probably needs testing too, but review is also crucial. Somebody should also walk through whether my analysis of this bug is correct.
Note to self: the
get_via_tor = 1; /* last resort: try routing it via Tor */
later in that function seems vestigial; I don't think it can ever result in anything being fetched. I should open another bug for that.
First comment is that this branch is built on top of a branch for #5095 (moved), that's probably a mistake, yes? Especially because that branch contains an unsquashed fixup! commit.
I buy your logic here, but I'll want to look it over once again tomorrow to see if we can end up calling the function over and over here. I'll try the patch in the meantime regardless, because I'm convinced it's correct currently.