I had 64 of those message on my exit node, and three on my middle node.
As rransom explained to me, certificates are fetched, not received (as the message might indicate), so both of those nodes did something wrong at some point. Might the first one being an exit node be the cause of a different behaviour ? (I don't know tor specs well enough to comment on that.)
Tor version on both my middle and exit nodes is Tor 0.2.3.13-alpha (git-e783afd37ecddc1b)
Can somebody check on his logs that a similar version of his tor relays might have a similar entry ?
Also, those 64 entries are almost exactly one minute away from eachother, and always with "maatuska".
They happenned between Apr 10 04:06:57.000 and Apr 10 05:12:01.000 and have been silent since then.
On the middle node, the three entries were also about "maatuska", also one minute away from each other, and also started around Apr 10 04:14:51.000 to and at Apr 10 04:16:53.000 since when they remained silent.
Both of these nodes had an uptime of 4d21h when it happenned. They were launched at less than 10m appart from eachother, so it might very well be uptime-related somehow, as the difference in uptime matches closely the difference in the time at which those messages began.
Just in case it might be relevant : the middle node is seeing about 10 times more traffic on average than the exit node.
I had 64 of those message on my exit node, and three on my middle node.
As rransom explained to me, certificates are fetched, not received (as the message might indicate), so both of those nodes did something wrong at some point. Might the first one being an exit node be the cause of a different behaviour ? (I don't know tor specs well enough to comment on that.)
Yes. Exit nodes fetch directory information earlier than most relays.
Tor version on both my middle and exit nodes is Tor 0.2.3.13-alpha (git-e783afd37ecddc1b)
Can somebody check on his logs that a similar version of his tor relays might have a similar entry ?
This is not needed -- 64 repeated fetches of the same directory document are unacceptable, and must be fixed.
We would like to know whether this bad behaviour happens in Tor 0.2.2.x, though.
Both of these nodes had an uptime of 4d21h when it happenned. They were launched at less than 10m appart from eachother, so it might very well be uptime-related somehow, as the difference in uptime matches closely the difference in the time at which those messages began.
The relays' uptime is not relevant -- this happened because maatuska generated a new directory-signing certificate.
The right solution here will probably involve looking very hard at the code that currently makes certificate downloads get attempted and/or reattempted, and seeing what, if anything, rate-limits attempts to re-download a certificate that has failed, then figuring out why this is happening. Insights welcome.
The download_status_t object that gets used for certificates is one in cert_list_t , which is a per-authority structure. We really need a download_status_t that works on a per-certificate basis to avoid this kind of bug.
Fortunately, cert_list_t is only used in routerlist.c, so in theory the changes should be easy to isolate.
(Only 4 functions touch the dl_status_t element of cert_list_t: get_cert_list(), authority_cert_dl_failed(), authority_cert_dl_looks_uncertain(), and authority_certs_fetch_missing().)
At present, authority_certs_fetch_missing() generates a list of identity digests of trusted dir servers and signers in the provided consensus; this is wrong in the case that it encounters a signed object in the consensus with a certificate other than the newest for that authority, and will cause it to repeatedly try to download the newest certificate for that authority, emit this warning when it sees it already has that one, and never get the one it actually needs to stop re-requesting it.
The solution is to modify authority_certs_fetch_missing() to assemble two lists of missing certificates, one by identity digest for any we don't have in trusted_dir_servers, and use /tor/keys/fp/ requests just as the current implementation does, but also assemble a list of (identity digest, signing key digest) pairs for signing certificates needed to verify a consensus and launch another request using /tor/keys/fp-sk/... for those certificates. Since nothing actually uses /tor/keys/fp-sk at the moment, implementation of this will occur pending verification that these requests actually work.
The fp-sk handler was added in tor-0.2.1.9-alpha. So, assuming that it works, and has worked since at least 0.2.2, we can safely assume that every directory supports it, since 0.2.1 is no longer supported on the network IIRC.
There's a fix for this on maint-0.2.3 in my bug5595 branch; it'll need a squash and porting-forward to maint-0.2.4 and master, and more testing. So far it hasn't exploded or done anything weird in early testing in client and middle relay configurations.
Woopsie, jumped the gun a bit there - what didn't explode turned out to be a vanilla 0.2.3 binary. That's what I get testing on so little sleep.
Anyway, more testing, a bit of tweaking and several bugfixes later, this really and truly looks to work just fine. See the squashed and rebased against latest maint-0.2.3 version in my bug5595-rebased-and-squashed-0.2.3 branch.
I'd like to request basic unit tests on the fp_pair_t map code.
fp_pair_map_{set,get}by_digests could save some duplicated code by being a wrapper on fp_pair_map{set,get}.
Should we still be calling "authority_cert_dl_failed" from trusted_dirs_load_certs_from_string? Even if so, I 'm not sure the comment in front of that point starting with "a duplicate on a download" any more.
Should the log_warn in authority_cert_dl_failed be LD_BUG?
Does authority_cert_dl_looks_uncertain need a variant that looks at id/sk failures? Or should it look at the number of id/sk failures itself?
When constructing the fp_pair string, I would be much more comfortable with something tor_asprintf()-based. I don't believe there are any bugs in what you have now, but I want to get us out of the habit of doing string construction like this.
I'd like to request basic unit tests on the fp_pair_t map code.
fp_pair_map_{set,get}by_digests could save some duplicated code by being a wrapper on fp_pair_map{set,get}.
Okay, I'll do those.
Should we still be calling "authority_cert_dl_failed" from trusted_dirs_load_certs_from_string? Even if so, I 'm not sure the comment in front of that point starting with "a duplicate on a download" any more.
I don't think we should remove the check; there are probably still possible edge cases where a duplicate cert might get downloaded. The comment could stand rewording in light of fixing this bug, though.
Should the log_warn in authority_cert_dl_failed be LD_BUG?
Changed.
Does authority_cert_dl_looks_uncertain need a variant that looks at id/sk failures? Or should it look at the number of id/sk failures itself?
No it doesn't; it's only called from networkstatus_check_consensus_signature() which is concerned with the trusted auth cert downloads.
When constructing the fp_pair string, I would be much more comfortable with something tor_asprintf()-based. I don't believe there are any bugs in what you have now, but I want to get us out of the habit of doing string construction like this.
Seems like it would be wise to try a certificate rotation with this in place in 0.2.4, before backporting it to 0.2.3. Then we'd at least explode the Tors of people who know they're running an alpha. :)
Seems like it would be wise to try a certificate rotation with this in place in 0.2.4, before backporting it to 0.2.3. Then we'd at least explode the Tors of people who know they're running an alpha. :)
Yeah, surviving a cert rotation without exploding or repeatedly refetching is the real test of this; since it'd already have a cert for every authority in that case, it'd end up using the new fp-sk queries rather than the fp ones.
Since an earlier draft of this didn't include the test of whether we already knew any certificate at all for an authority, on bootstrapping it would launch both the fp requests for the authority certs and fp-sk requests for all the specific signing keys seen in the consensus, and then get identical results for both, and whichever finished later would give a warning about duplicates. That prompted the revision in question, but also makes me confident the fp-sk requests do work correctly. Testing bootstrapping doesn't provide any scenario in which only fp-sk requests are used as in a certificate rotation, though.
It's probably worth looking into how difficult it is to simulate a certificate rotation with Chutney, too. If we ever manage to turn Chutney into a real integration test suite, that should probably be one of the scenarios it runs.