Opened 5 years ago

Closed 2 years ago

#13207 closed enhancement (wontfix)

Is rend_cache_clean_v2_descs_as_dir cutoff crazy high?

Reported by: arma Owned by: dgoulet
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs
Cc: Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor: SponsorR-can

Description

  time_t cutoff = now - REND_CACHE_MAX_AGE - REND_CACHE_MAX_SKEW;

That's currently 3 days.

Yet

config.c:  V(RendPostPeriod,              INTERVAL, "1 hour"),

So we expect our current one to get overwritten once an hour, yet if the hidden service goes away (stops publishing) we continue to serve its old (presumably no longer working) descriptor for 71 more hours.

One downside to this poor tuning is that we waste the time of clients who try to access the hidden service -- if we instead tell them there is no descriptor, they could go to another hsdir or give up more promptly.

One upside is that if we say we don't have one, clients *will* go visit five other hsdirs before giving up, which could result in a lot of thrashing. Whereas providing an old one will keep the client distracted for a while. But that's not really a good reason.

Another downside is that it's easier to stuff our cache full of things, i.e. use hsdirs for arbitrary storage of blobs for three days.

Child Tickets

Change History (23)

comment:1 Changed 5 years ago by special

Yes.

First of all, this value is used by clients *and* hsdirs. Unless I've missed something, this means that a client which accesses a HS, waits two days without restarting, then tries again will reach out to all old intro points before fetching a new descriptor.

I think it's worth looking at this as two separate cases:

There is no value in serving descriptors older than the descriptor-id time period of 24 hours. A client with a clock this inaccurate can't be expected to function. These descriptors are unlikely to be queried and even less likely to be useful. The best implementation is to clean these when the descriptor-id for that service changes, with some fuzz for clock skew.

The current value is too high, considering that RendPostPeriod is usually one hour; a 4-hour-old descriptor on a HSDir is almost certainly useless. But, I don't see a very compelling reason to reduce it below 24 hours either. The value used here is effectively a maximum value for RendPostPeriod.

comment:2 Changed 5 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.6.x-final

comment:3 Changed 4 years ago by arma

The relevant code is

  /* Is descriptor too old? */
  if (parsed->timestamp < now - REND_CACHE_MAX_AGE-REND_CACHE_MAX_SKEW) {
    log_warn(LD_REND, "Service descriptor with service ID %s is too old.",
             safe_str_client(service_id));
    goto err;
  }
  /* Is descriptor too far in the future? */
  if (parsed->timestamp > now + REND_CACHE_MAX_SKEW) {
    log_warn(LD_REND, "Service descriptor with service ID %s is too far in "
                      "the future.", safe_str_client(service_id));
    goto err;
  }

I bet if we set MAX_SKEW to 18 hours, and MAX_AGE to 6 hours, then we'd get a lot of what we want here.

That is, in normal circumstances we'll keep the descriptor for 23 or 24 hours after it's generated (it's not exactly 24 because check out the d->timestamp -= d->timestamp % 3600; /* Round down to nearest hour */ line in rend_service_update_descriptor()), and accept descriptors from hidden services whose time is up to 17 or 18 hours in the future.

If we wanted to maintain the "up to 24 hours in the future" feature, then we would set MAX_AGE to 0 hours, and MAX_SKEW to 24 hours. But that sure is unintuitive.

Actually, I think our best bet here would be to remove REND_CACHE_MAX_AGE from existence (i.e. make it 0 and take it out), and come up with a better name for REND_CACHE_MAX_SKEW to indicate that it has to do with descriptor age too. Any good names?

comment:4 Changed 4 years ago by dgoulet

Milestone: Tor: 0.2.6.x-finalTor: 0.2.7.x-final

This one needs more testing but should be worked on soon since this affects HS reachability, a key aspect of SponsorR thus changing the milestone.

comment:5 Changed 4 years ago by nickm

Status: newassigned

comment:6 Changed 4 years ago by nickm

Keywords: 027-triaged-1-in added

Marking more tickets as triaged-in for 0.2.7

comment:7 Changed 4 years ago by isabela

Points: medium
Priority: normalmajor
Version: Tor: 0.2.7

comment:8 Changed 4 years ago by dgoulet

Parent ID: #13209

comment:9 Changed 4 years ago by sysrqb

Handling clock skew is always tricky. 24 hours seems reasonable. In this case:

1) HSDirs should only keep descriptors which were published most 24 hours, and fewer hours if the current descriptor is overwritten with a newer version or to prevent running out of memory
2) It doesn't seem like a client's cache needs this 24 hour cutoff (it could be less), but it doesn't seem like it will hurt.

Assuming the client's clock is not skewed significantly, the descriptor that the client possesses is deleted from the cache around the same time the descriptor becomes invalid. If the clock is skewed by fewer than 3 hours then this probably, maybe, likely won't impact onion service reachability. Any client running with a clock positively skewed by more than 3 hours will never have a usable recent consensus, and a client whose clock is negatively skewed by more than 3 hours will never have a usable consensus.

Are there objections to setting the cutoff at 26 or 27 hours? Then we have two options:

1) REND_CACHE_MAX_AGE = 24 hours, REND_CACHE_MAX_SKEW = 3 hours
2) REND_CACHE_DESC_LIFETIME = 27 hours

There may yet be an even better name.

Interestingly, rend-spec agrees with this sentiment:

Alice should cache the descriptor locally, but should not use
descriptors that are more than 24 hours older than their timestamp.

comment:10 in reply to:  3 ; Changed 4 years ago by sysrqb

Status: assignedneeds_revision

Replying to arma:

Actually, I think our best bet here would be to remove REND_CACHE_MAX_AGE from existence (i.e. make it 0 and take it out), and come up with a better name for REND_CACHE_MAX_SKEW to indicate that it has to do with descriptor age too. Any good names?

Ok. Branch bug13207 now does this, after writing it and tweaking it a few times and eventually getting rid of MAX_AGE. I don't have a better name for MAX_SKEW, so I left it.

Please review branch bug13207.

comment:11 in reply to:  10 Changed 4 years ago by sysrqb

Status: needs_revisionnew

Replying to sysrqb:

Please review branch bug13207.

Ok, that was indeed the wrong approach. It decided that we should assume the current descriptor is valid until the end of the current period (based on the local clock) plus 24 hours, accounting for clock skew. That really is a bad idea.

New approach: HSDirs should hold descriptors which have a publication time of less than 25 hours ago. We shouldn't care if the HSDir thinks it's responsible for the descriptor ID because this overrides the the logic for keeping descriptors for clients with skewed clocks.

If we set the cutoff at 25 hours then:

  • If a client's clock is 24 hours behind the service's clock, and the HSDir is in between them, then the HSDir should serve the descriptor (if it was given one in the past) despite the service publishing new descriptors on a new HSDir
    • Unfortunately, the intro points may be invalid
  • If the client's clock is 24 hours ahead of the service's clock, and the HSDir's clock where the client sends its query is in between, then the HSDir won't have the descriptor. The HSDir where the service publishes its descriptor should happily serve it, if anyone asks for it.
    • The service is still publishing its descriptors on the previous period's HSDirs
      • Should clients ask HSDirs from the previous period for a descriptor? At face value, this seems like a bad idea
  • If the client's clock is 24 hours behind the HSDir's clock, and the service is in between them, then the HSDir should serve the descriptor and it should still be valid. At some point the service will enter the next period and publish new descriptors there, but its old descriptor should remain for some time.
    • As above, this may make clients sad when the service changes its intro points or goes offline
  • If the client's clock is 24 hours ahead of the HSDir's clock, and the service is in between them, then the client will query the current HSDir until it enters the next period. There's an amount of time where the client queries the next set of HSDirs while the service is publishing to the current set.
  • If the HSDir's clock is 24 hours behind the service's clock, and the client is in between them, then the HSDir should have a descriptor which was published approximately 23 hours earlier (before the service moved to the next period)
  • If the HSDir's clock is 24 hours ahead of the service's clock, and the client is in between them, then the HSDir will serve the descriptor for 1 hour before it considered the descriptor expired.

comment:12 Changed 4 years ago by dgoulet

Keywords: TorCore201508 added
Owner: set to dgoulet
Status: newaccepted

comment:13 Changed 4 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

comment:14 Changed 4 years ago by dgoulet

Keywords: 027-triaged-1-in TorCore201508 removed

comment:15 Changed 4 years ago by nickm

Keywords: SponsorR removed
Sponsor: SponsorR

Bulk-replace SponsorR keyword with SponsorR sponsor field in Tor component.

comment:16 Changed 3 years ago by dgoulet

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final
Priority: HighMedium
Severity: Normal
Version: Tor: 0.2.7

Moving this to 029. IMO, there are still unanswered question here on the HSDir behavior if we want to make that change. See below.

New approach: HSDirs should hold descriptors which have a publication time of less than 25 hours ago. We shouldn't care if the HSDir thinks it's responsible for the descriptor ID because this overrides the the logic for keeping descriptors for clients with skewed clocks.

Basically, drop the "are we responsible for desc ID?" check, and keep any descriptors that matches this timestamp heuristic? This is scaring me a bit that is descriptor can be used to store arbitrary data on _all_ HSDirs instead of 6...

I fear that without a statistics of how much clock skew on average we have in the network, the skew limit is a bit arbitrary. Tbh, 25 hours clock skew is massive for a client and even more for a service... I'm not even sure that service will be able to operate correctly... But let's assume we use that because worst case scenario, it seems there is no way we can use a crazy cutoff efficiently and still use the check on the descriptor ID, right?

Maybe it's the start of a new era where HSDir stores all the things but expires them quicker?...

comment:17 Changed 3 years ago by dgoulet

Sponsor: SponsorRSponsorR-can
Type: defectenhancement

Movement here. #18332 will remove the logic on the HSDir side of "am I responsible for that descriptor ID" so we can go back to actually considering the New approach: :)

comment:18 Changed 3 years ago by isabela

Milestone: Tor: 0.2.9.x-finalTor: 0.2.???

tickets market to be removed from milestone 029

comment:19 Changed 2 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:20 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:21 Changed 2 years ago by dgoulet

Parent ID: #13209
Points: medium3

Parent ID doesn't make sense.

comment:22 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:23 Changed 2 years ago by dgoulet

Resolution: wontfix
Status: acceptedclosed

Ok, I'm closing this one for a couple of reasons. First, it seems unclear on what is a sane value for a cutoff and anything we change might either introduce instability or need lots of testing.

Second, prop224 work (#12424) is fixing that by adding a expire time field in the descriptor (currently 3 hours).

Note: See TracTickets for help on using tickets.