Rate limit router_pick_published_address log message

changed milestone to %Tor: 0.2.9.x-final

added component::core tor/tor milestone::Tor: 0.2.9.x-final priority::medium regression resolution::fixed severity::normal status::closed type::defect version::tor 0.2.8.9 labels

We should probably fix this in 0.2.9?

Trac:
Keywords: N/A deleted, regression added
Milestone: N/A to Tor: 0.2.9.x-final

Trac:
Cc: N/A to sirmatt@ksu.edu

I think we should track down when it got much more commonly called, and refactor or cache or whatever so it stops needing to do that.

For example, we ask ourselves if we know our address when making a decision about whether to go to authorities or other caches when fetching dir info.

as a kludge, you can just throw in a tor_assert_nonfatal_unreached() in your local copy of router_pick_published_address() and see what the stack traces say...

Trac:
Username: Knight
Cc: sirmatt@ksu.edu to sirmatt@ksu.edu, Knight

Actually, it looks like this is already cached, in config.c. So there's no harm in just moving the message. What do you think of just moving this log message from router_pick_published_address() to the end of resolve_my_address() ?

Seems sensible to me.

#20423 (moved) contains a report of this message being logged excessively, along with other issues.

I think the architecture fix that we will want is to recognize that for most of the calls to router_pick_published_address(), and especially the ones that we call often, what we really mean to ask is "did we figure out our address yet?", not "please go figure out our address if we haven't done it yet". So if we pass an "only look at the cached variables" option to the function, which asks it not to call the heavyweight functions, we'll be a lot happier. We'll want to make sure that we call the heavyweight functions once near startup. Those steps will make things better for the future.

But first! There is a mystery: It looks like in some circumstances we started calling this function a crazy number of times, starting in 0.2.8.x. And we didn't do that in 0.2.7.x. What changed?

I would indeed like to get some call graphs or stack traces or debug lines the like, for when it happens a whole lot.

Fortunately, on https://lists.torproject.org/pipermail/tor-relays/2016-November/010926.html, pastly said

I run Tor network simulations and see this all the time. I haven't taken
the time to look into it, and always have assumed it's just a minor
misconfiguration in the simulator or something that only exists
_because_ it's a simulation.

so I'm going to take that as a "yes, this crazy number of calls to router_pick_published_address is repeatable, and also I can help debug it." Is that a fair hope, pastly? :)

router_pick_published_address() is called in maint-0.2.9 from five places:

In control.c's getinfo_helper_misc(), in response to a 'getinfo address' request. I think it's fair to say that's not our culprit.
In dirserv.c's directory_fetches_from_authorities(). This is very likely the problem.
In router.c's router_rebuild_descriptor() when the current descriptor is dirty, to see if we don't know our address so we should short-circuit the rest of this function until we do.
In router.c's router_build_fresh_descriptor() to make sure we didn't try to build a descriptor mistakenly. I don't think rebuilding the descriptor happens too often, but hey I could be wrong.
In transports.c's pt_get_extra_info_descriptor_string(), to figure out what address to write in our extrainfo descriptor when we're offering a pluggable transport. I'm also not worried about this one.

directory_fetches_from_authorities() is called in maint-0.2.9 from six places:

directory.c's directory_get_from_dirserver(), when we launch a new directory fetch/post of any kind.
directory.c's dir_routerdesc_download_failed() and dir_microdesc_download_failed(), which are called when we didn't get a descriptor we wanted, to decide how to proceed.
dirserv.c's directory_fetches_dir_info_early(), which is used to decide schedules for fetching the new consensus and for fetching new descriptors we just learned about.
main.c's directory_info_has_arrived() which we call whenever we finish receiving any directory object.
main.c's fetch_networkstatus_callback(), which is called once a second during bootstrap and once a minute otherwise.
networkstatus.c's networkstatus_consensus_can_use_extra_fallbacks(), which is new and called from a variety of places so I'm suspicious of it.

Trac:
stdout-tor-1000.log.tgz

Log file from "never" used relay in a Shadow simulation. Decompresses to ~5MiB

Preliminary instrumentation makes me think the bursts are download_status_reset() calling find_dl_schedule() calling networkstatus_consensus_can_use_extra_fallbacks() calling directory_fetches_from_authorities() calling router_pick_published_address().

Oh hey, is download_status_reset() being called for every single descriptor?

For posterity, the commit in my 20610-instrument branch is what led me to the above statement.

You can run it as a relay, and grep your info-level logs for "router_pick_published_address" and "find_dl_schedule".

I posted a patch for this issue to #20423 (moved).

Patch on #20423 (moved) works to reduce spam.

Cool. Since the patch on #20423 (moved) was was merged, I'll close this ticket too. Please let me know if it should still be open.

Trac:
Status: new to closed
Resolution: N/A to fixed

closed

moved to tpo/core/tor#20610 (closed)

Rate limit router_pick_published_address log message

Child items 0

Activity