as a kludge, you can just throw in a tor_assert_nonfatal_unreached() in your local copy of router_pick_published_address() and see what the stack traces say...
Actually, it looks like this is already cached, in config.c. So there's no harm in just moving the message. What do you think of just moving this log message from router_pick_published_address() to the end of resolve_my_address() ?
I think the architecture fix that we will want is to recognize that for most of the calls to router_pick_published_address(), and especially the ones that we call often, what we really mean to ask is "did we figure out our address yet?", not "please go figure out our address if we haven't done it yet". So if we pass an "only look at the cached variables" option to the function, which asks it not to call the heavyweight functions, we'll be a lot happier. We'll want to make sure that we call the heavyweight functions once near startup. Those steps will make things better for the future.
But first! There is a mystery: It looks like in some circumstances we started calling this function a crazy number of times, starting in 0.2.8.x. And we didn't do that in 0.2.7.x. What changed?
I would indeed like to get some call graphs or stack traces or debug lines the like, for when it happens a whole lot.
I run Tor network simulations and see this all the time. I haven't takenthe time to look into it, and always have assumed it's just a minormisconfiguration in the simulator or something that only exists_because_ it's a simulation.
so I'm going to take that as a "yes, this crazy number of calls to router_pick_published_address is repeatable, and also I can help debug it." Is that a fair hope, pastly? :)
router_pick_published_address() is called in maint-0.2.9 from five places:
In control.c's getinfo_helper_misc(), in response to a 'getinfo address' request. I think it's fair to say that's not our culprit.
In dirserv.c's directory_fetches_from_authorities(). This is very likely the problem.
In router.c's router_rebuild_descriptor() when the current descriptor is dirty, to see if we don't know our address so we should short-circuit the rest of this function until we do.
In router.c's router_build_fresh_descriptor() to make sure we didn't try to build a descriptor mistakenly. I don't think rebuilding the descriptor happens too often, but hey I could be wrong.
In transports.c's pt_get_extra_info_descriptor_string(), to figure out what address to write in our extrainfo descriptor when we're offering a pluggable transport. I'm also not worried about this one.
directory_fetches_from_authorities() is called in maint-0.2.9 from six places:
directory.c's directory_get_from_dirserver(), when we launch a new directory fetch/post of any kind.
directory.c's dir_routerdesc_download_failed() and dir_microdesc_download_failed(), which are called when we didn't get a descriptor we wanted, to decide how to proceed.
dirserv.c's directory_fetches_dir_info_early(), which is used to decide schedules for fetching the new consensus and for fetching new descriptors we just learned about.
main.c's directory_info_has_arrived() which we call whenever we finish receiving any directory object.
main.c's fetch_networkstatus_callback(), which is called once a second during bootstrap and once a minute otherwise.
networkstatus.c's networkstatus_consensus_can_use_extra_fallbacks(), which is new and called from a variety of places so I'm suspicious of it.
Preliminary instrumentation makes me think the bursts are download_status_reset() calling find_dl_schedule() calling networkstatus_consensus_can_use_extra_fallbacks() calling directory_fetches_from_authorities() calling router_pick_published_address().
Oh hey, is download_status_reset() being called for every single descriptor?