Opened 2 years ago

Closed 21 months ago

Last modified 14 months ago

#16381 closed defect (fixed)

Bad timestamp check when storing an HS descriptor on the client

Reported by: dgoulet Owned by: dgoulet
Priority: High Milestone: Tor: 0.2.7.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-hs, TorCore201508, PostFreeze027, 2016-bug-retrospective
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor: SponsorR

Description

Fortunately, this bug "should" happen rarely in the real Tor network but still could lead to reachability issues for high traffic hidden service.

Let's assume that all introduction points (IP) of an hidden service need to be changed (intro2 max count or expiring) in a single RendPostPeriod (default is 1 hour), we do the intro dance with our newly picked introduction points, close the old ones and upload the new descriptor. The usual.

Now, let's say we have a client that did connect to the service before in the same hour so the client has the descriptor in its cache. Then, the client tries to connect again, the IPs located in the cached descriptor are used and the client will eventually get a NACK from all of them since the service just cycled them all. This will eventually trigger a refetch of the descriptor since we have no more IP that we can use. So far so good.

Here is when it goes wrong, when receiving the new descriptor just published by the HS, we do a series of check in rend_cache_store_v2_desc_as_client() which all passes correctly in our case except this one:

  e = (rend_cache_entry_t*) strmap_get_lc(rend_cache, key);
  if (e && e->parsed->timestamp >= parsed->timestamp) {
    log_info(LD_REND, "We already have a new enough service descriptor for "
                      "service ID %s with the same desc ID and version.",
             safe_str_client(service_id));
    goto okay;
  }

The HS protocol speficies that the timestamp in the descriptor MUST be rounded down to the nearest hour. For instance, if it's 16h54, it will be set to 16h00. See rend_service_update_descriptor():

  d->timestamp = time(NULL);
  d->timestamp -= d->timestamp % 3600; /* Round down to nearest hour */

The side effect of this is that if an HS uploads a new descriptor with a new set of IPs, they won't be usable until the next hour because the tor client doesn't keep the new descriptor even though the intro points did actually change. Thus this means that anything that changes in a time period of one hour will not be seen by the client if the descriptor was already cached before.

Keep in mind that a client reuses a rendez-vous circuit but stops after MaxCircuitDirtiness (default: 10 minutes) so after 10 minutes it will redo the introduction and rendezvous circuits.

(I was able to trigger this in a chutney network by setting a lifetime of 30 seconds for introduction point which creates a lot of rotation.)

I think the fix here should look like this (fun fact, it's what rend_cache_store_v2_desc_as_dir() does):

- if (e && e->parsed->timestamp >= parsed->timestamp) {
+ if (e && e->parsed->timestamp > parsed->timestamp) {

Child Tickets

TicketSummaryOwner
#16389Redesign the HS client descriptor cache

Change History (22)

comment:1 Changed 2 years ago by arma

Exciting. I remember writing that = not too long ago, and I remember I had a reason for it.

comment:2 follow-up: Changed 2 years ago by arma

Yes, see #14219 for when this bug went in. git show 9407040c5921

I think we can't just change >= to >, because then when we get the same descriptor, we will not trigger the

+    log_info(LD_REND, "We already have a new enough service descriptor for "
                       "service ID %s with the same desc ID and version.",

part.

It looks like we want to actually see if the descriptor is different.

Also, if the timestamps really are only at the hour granularity, I think we can no longer use them for time ordering of hs descs. Now might be a great time to brainstorm totally different approaches, like putting a counter in the hs desc.

comment:3 Changed 2 years ago by arma

  • Priority changed from normal to major

comment:4 follow-up: Changed 2 years ago by arma

Another option is to redesign the logic to be cleaner, and keep an intro point failure cache. It seems like every time we encounter this "we keep state about the previous hs desc by keeping a copy of it, and then seeing if the new one is different, get it?" logic, somebody finds it confusing. Maybe now is a good time for it to die? :)

comment:5 in reply to: ↑ 2 Changed 2 years ago by arma

Replying to arma:

Yes, see #14219 for when this bug went in. git show 9407040c5921

Regression introduced in Tor 0.2.6.3-alpha I think.

comment:6 Changed 2 years ago by dgoulet

FYI, this is another way to reproduce it which makes this bug much more serious (thanks arma to set it to major)

  1. Start HS
  2. Connect to it with client
  3. Restart HS
  4. Client is unable to connect until the next hour.

Thanks to special for pointing it out:

16:24 < special> so is this reproducible with: 1) start HS; 2) access from a tor client; 3) restart HS; 4) access from the same client

comment:7 Changed 2 years ago by arma

To make it clearer, the reason we can't just strcmp the old desc with the new one is that each hidden service actually makes *two* descriptors per publish -- one per replica. That's exactly what bug #14219 was about.

comment:8 in reply to: ↑ 4 Changed 2 years ago by special

Replying to arma:

Another option is to redesign the logic to be cleaner, and keep an intro point failure cache.

+1 to this approach. This is essentially the same as how we handle descriptors via last_hid_serv_requests.

comment:9 follow-up: Changed 2 years ago by arma

I think we should try to do a straight revert of 9407040c5921 in 0.2.6.x (the old behavior wasn't so bad), and then think about fixing this more thoroughly after that.

comment:10 in reply to: ↑ 9 Changed 2 years ago by dgoulet

  • Status changed from new to needs_review

Replying to arma:

I think we should try to do a straight revert of 9407040c5921 in 0.2.6.x (the old behavior wasn't so bad), and then think about fixing this more thoroughly after that.

I agree here, this regression is worst so let's go back to the original behavior that at least allows the client to connect and figure out a better solution in the short term (#16389).

Here is the revert branch: bug16381_026_01-revert

comment:11 Changed 2 years ago by nickm

Okay, merged to 0.2.6 and forward. Please change status as appropriate?

comment:12 Changed 2 years ago by dgoulet

  • Status changed from needs_review to new

Ok thanks nickm! Switching back to "new" so we can actually address this bug now by resolving the child ticket.

comment:13 follow-up: Changed 23 months ago by Alan

2.6.9 (the most recent release) contains the regression. Will the reverted behavior be released as 2.6.10, or will a fix for the regression not appear until the new code is released in 2.7?

comment:14 in reply to: ↑ 13 ; follow-up: Changed 23 months ago by dgoulet

Replying to Alan:

2.6.9 (the most recent release) contains the regression. Will the reverted behavior be released as 2.6.10, or will a fix for the regression not appear until the new code is released in 2.7?

Yes, it's already in the 0.2.6.x git branch so next 0.2.6 release, it will have it.

See commit 8acf5255c20c667f32313ee672c85f6ae00a4f87

comment:15 in reply to: ↑ 14 ; follow-up: Changed 23 months ago by Alan

it's already in the 0.2.6.x git branch so next 0.2.6 release, it will have it.

That part I understood from the previous comments. My question was, will there be another 0.2.6.x release? Is there an anticipated timeframe?

Last edited 23 months ago by Alan (previous) (diff)

comment:16 in reply to: ↑ 15 ; follow-up: Changed 23 months ago by nickm

Replying to Alan:

it's already in the 0.2.6.x git branch so next 0.2.6 release, it will have it.

That part I understood from the previous comments. My question was, will there be another 0.2.6.x release?

Yes.

Is there an anticipated timeframe?

Within the next week or so, I think (but that's an estimate, not a promise).

comment:17 in reply to: ↑ 16 Changed 23 months ago by Alan

Thanks

comment:18 Changed 22 months ago by dgoulet

  • Keywords TorCore201508 added
  • Owner set to dgoulet
  • Status changed from new to accepted

We expect the child ticket to get merged in 027 so setting this one for august and 027 final.

comment:19 Changed 21 months ago by nickm

  • Keywords PostFreeze027 added

I'd merge patches for these for 0.2.7 if they come in on time. In some cases, that will require figuring out an as-yet-unsolved bugs.

comment:20 Changed 21 months ago by dgoulet

  • Resolution set to fixed
  • Status changed from accepted to closed

Bad commit has been reverted and child ticket resolved. Closing.

comment:21 Changed 20 months ago by dgoulet

  • Keywords SponsorR removed
  • Sponsor set to SponsorR

comment:22 Changed 14 months ago by nickm

  • Keywords 2016-bug-retrospective added

Mark more tickets for bug retrospective based on hand-review of changelogs from 0.2.5 onwards.

Note: See TracTickets for help on using tickets.