Opened 7 years ago

Closed 5 years ago

#5595 closed defect (fixed)

Some relays tried to refetch maatuska's new certificate repeatedly

Reported by: rransom Owned by: andrea
Priority: Very High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay 023-backport
Cc: koolfy Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

koolfy reports the following log message on one of his relays:

05:12:01 [WARN] Got a certificate for maatuska, but we already have it. Maybe they haven't updated it. Waiting for
a while. [64 duplicates hidden]

“64 duplicates hidden” is bad.

Child Tickets

Change History (27)

comment:1 Changed 7 years ago by koolfy

I had 64 of those message on my exit node, and three on my middle node.

As rransom explained to me, certificates are fetched, not received (as the message might indicate), so both of those nodes did something wrong at some point. Might the first one being an exit node be the cause of a different behaviour ? (I don't know tor specs well enough to comment on that.)

Tor version on both my middle and exit nodes is Tor 0.2.3.13-alpha (git-e783afd37ecddc1b)

Can somebody check on his logs that a similar version of his tor relays might have a similar entry ?

Also, those 64 entries are almost exactly one minute away from eachother, and always with "maatuska".
They happenned between Apr 10 04:06:57.000 and Apr 10 05:12:01.000 and have been silent since then.

On the middle node, the three entries were also about "maatuska", also one minute away from each other, and also started around Apr 10 04:14:51.000 to and at Apr 10 04:16:53.000 since when they remained silent.

Both of these nodes had an uptime of 4d21h when it happenned. They were launched at less than 10m appart from eachother, so it might very well be uptime-related somehow, as the difference in uptime matches closely the difference in the time at which those messages began.

Just in case it might be relevant : the middle node is seeing about 10 times more traffic on average than the exit node.

comment:2 in reply to:  1 Changed 7 years ago by rransom

Replying to koolfy:

I had 64 of those message on my exit node, and three on my middle node.

As rransom explained to me, certificates are fetched, not received (as the message might indicate), so both of those nodes did something wrong at some point. Might the first one being an exit node be the cause of a different behaviour ? (I don't know tor specs well enough to comment on that.)

Yes. Exit nodes fetch directory information earlier than most relays.

Tor version on both my middle and exit nodes is Tor 0.2.3.13-alpha (git-e783afd37ecddc1b)

Can somebody check on his logs that a similar version of his tor relays might have a similar entry ?

This is not needed -- 64 repeated fetches of the same directory document are unacceptable, and must be fixed.

We would like to know whether this bad behaviour happens in Tor 0.2.2.x, though.

Both of these nodes had an uptime of 4d21h when it happenned. They were launched at less than 10m appart from eachother, so it might very well be uptime-related somehow, as the difference in uptime matches closely the difference in the time at which those messages began.

The relays' uptime is not relevant -- this happened because maatuska generated a new directory-signing certificate.

comment:3 Changed 7 years ago by nickm

Milestone: Tor: 0.2.3.x-final

Marking for 0.2.3.x, but it's backportable if it's clean

comment:4 Changed 7 years ago by nickm

The right solution here will probably involve looking very hard at the code that currently makes certificate downloads get attempted and/or reattempted, and seeing what, if anything, rate-limits attempts to re-download a certificate that has failed, then figuring out why this is happening. Insights welcome.

comment:5 Changed 7 years ago by rransom

See also #2991.

comment:6 Changed 7 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final

Marking for 0.2.4.x, but it's backportable if it's clean

comment:7 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:8 Changed 7 years ago by nickm

Component: Tor RelayTor

comment:9 Changed 7 years ago by nickm

Well, this is a pile of junkiness.

The download_status_t object that gets used for certificates is one in cert_list_t , which is a per-authority structure. We really need a download_status_t that works on a per-certificate basis to avoid this kind of bug.

Fortunately, cert_list_t is only used in routerlist.c, so in theory the changes should be easy to isolate.

comment:10 Changed 7 years ago by nickm

(Only 4 functions touch the dl_status_t element of cert_list_t: get_cert_list(), authority_cert_dl_failed(), authority_cert_dl_looks_uncertain(), and authority_certs_fetch_missing().)

comment:11 Changed 7 years ago by nickm

Keywords: 023-backport added

comment:12 Changed 7 years ago by andrea

Owner: set to andrea
Status: newassigned

comment:13 Changed 7 years ago by andrea

At present, authority_certs_fetch_missing() generates a list of identity digests of trusted dir servers and signers in the provided consensus; this is wrong in the case that it encounters a signed object in the consensus with a certificate other than the newest for that authority, and will cause it to repeatedly try to download the newest certificate for that authority, emit this warning when it sees it already has that one, and never get the one it actually needs to stop re-requesting it.

The solution is to modify authority_certs_fetch_missing() to assemble two lists of missing certificates, one by identity digest for any we don't have in trusted_dir_servers, and use /tor/keys/fp/<identity-digest> requests just as the current implementation does, but also assemble a list of (identity digest, signing key digest) pairs for signing certificates needed to verify a consensus and launch another request using /tor/keys/fp-sk/... for those certificates. Since nothing actually uses /tor/keys/fp-sk at the moment, implementation of this will occur pending verification that these requests actually work.

comment:14 Changed 7 years ago by nickm

The fp-sk handler was added in tor-0.2.1.9-alpha. So, assuming that it *works*, and has worked since at least 0.2.2, we can safely assume that every directory supports it, since 0.2.1 is no longer supported on the network IIRC.

comment:16 Changed 6 years ago by andrea

Status: assignedneeds_review

There's a fix for this on maint-0.2.3 in my bug5595 branch; it'll need a squash and porting-forward to maint-0.2.4 and master, and more testing. So far it hasn't exploded or done anything weird in early testing in client and middle relay configurations.

comment:17 Changed 6 years ago by andrea

Woopsie, jumped the gun a bit there - what didn't explode turned out to be a vanilla 0.2.3 binary. That's what I get testing on so little sleep.

Anyway, more testing, a bit of tweaking and several bugfixes later, this really and truly looks to work just fine. See the squashed and rebased against latest maint-0.2.3 version in my bug5595-rebased-and-squashed-0.2.3 branch.

comment:18 Changed 6 years ago by nickm

Status: needs_reviewneeds_revision

Looks good! Here's what I saw while reading it.

I'd like to request basic unit tests on the fp_pair_t map code.

fp_pair_map_{set,get}_by_digests could save some duplicated code by being a wrapper on fp_pair_map_{set,get}.

Should we still be calling "authority_cert_dl_failed" from trusted_dirs_load_certs_from_string? Even if so, I 'm not sure the comment in front of that point starting with "a duplicate on a download" any more.

Should the log_warn in authority_cert_dl_failed be LD_BUG?

Does authority_cert_dl_looks_uncertain need a variant that looks at id/sk failures? Or should it look at the number of id/sk failures itself?

When constructing the fp_pair string, I would be much more comfortable with something tor_asprintf()-based. I don't believe there are any bugs in what you have now, but I want to get us out of the habit of doing string construction like this.

comment:19 in reply to:  18 Changed 6 years ago by andrea

Replying to nickm:

Looks good! Here's what I saw while reading it.

I'd like to request basic unit tests on the fp_pair_t map code.

fp_pair_map_{set,get}_by_digests could save some duplicated code by being a wrapper on fp_pair_map_{set,get}.

Okay, I'll do those.

Should we still be calling "authority_cert_dl_failed" from trusted_dirs_load_certs_from_string? Even if so, I 'm not sure the comment in front of that point starting with "a duplicate on a download" any more.

I don't think we should remove the check; there are probably still possible edge cases where a duplicate cert might get downloaded. The comment could stand rewording in light of fixing this bug, though.

Should the log_warn in authority_cert_dl_failed be LD_BUG?

Changed.

Does authority_cert_dl_looks_uncertain need a variant that looks at id/sk failures? Or should it look at the number of id/sk failures itself?

No it doesn't; it's only called from networkstatus_check_consensus_signature() which is concerned with the trusted auth cert downloads.

When constructing the fp_pair string, I would be much more comfortable with something tor_asprintf()-based. I don't believe there are any bugs in what you have now, but I want to get us out of the habit of doing string construction like this.

Okay, changed.

comment:20 Changed 6 years ago by nickm

I think that using smartlist_join_strings() with "+" won't work, since the first element of the list is "fp-sk/"

Other than that, the changes look fine.

comment:21 Changed 6 years ago by andrea

Status: needs_revisionneeds_review

See my bug5595-v2-squashed branch for second draft.

comment:22 Changed 6 years ago by nickm

Looks okay to me!

comment:23 Changed 6 years ago by andrea

Milestone: Tor: 0.2.4.x-finalTor: 0.2.3.x-final
Status: needs_reviewnew

This is merged into maint-0.2.4 and master now; resetting to new and changing milestone to 0.2.3.x-final pending decision on 0.2.3 merge.

comment:24 Changed 6 years ago by arma

Seems like it would be wise to try a certificate rotation with this in place in 0.2.4, before backporting it to 0.2.3. Then we'd at least explode the Tors of people who know they're running an alpha. :)

comment:25 in reply to:  24 Changed 6 years ago by andrea

Replying to arma:

Seems like it would be wise to try a certificate rotation with this in place in 0.2.4, before backporting it to 0.2.3. Then we'd at least explode the Tors of people who know they're running an alpha. :)

Yeah, surviving a cert rotation without exploding or repeatedly refetching is the real test of this; since it'd already have a cert for every authority in that case, it'd end up using the new fp-sk queries rather than the fp ones.

Since an earlier draft of this didn't include the test of whether we already knew any certificate at all for an authority, on bootstrapping it would launch both the fp requests for the authority certs and fp-sk requests for all the specific signing keys seen in the consensus, and then get identical results for both, and whichever finished later would give a warning about duplicates. That prompted the revision in question, but also makes me confident the fp-sk requests do work correctly. Testing bootstrapping doesn't provide any scenario in which *only fp-sk* requests are used as in a certificate rotation, though.

It's probably worth looking into how difficult it is to simulate a certificate rotation with Chutney, too. If we ever manage to turn Chutney into a real integration test suite, that should probably be one of the scenarios it runs.

comment:26 Changed 6 years ago by nickm

Status: newneeds_review

comment:27 Changed 5 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final
Resolution: fixed
Status: needs_reviewclosed

Marking a batch of tickets that had been under consideration for 0.2.3 backport as fixed-in-0.2.4. There is no plan for further 0.2.3 releases.

Note: See TracTickets for help on using tickets.