Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#16389 closed defect (fixed)

Redesign the HS client descriptor cache

Reported by: dgoulet Owned by:
Priority: High Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-hs
Cc: asn Actual Points:
Parent ID: #16381 Points:
Reviewer: Sponsor: SponsorR

Description (last modified by nickm)

We merged a patch here, but see arma's comment below.

Bug #16381 vs #14219 demonstarted an important issue with the current design of the HS client descriptor cache.

In a nutshell, we can't rely on the timestamp to order descriptors because the timestamp is rounded down to the nearest hour thus any change in that time period will never be seen by the client (#16381). Furthermore, we can't rely on comparing the descriptor content because we have two replicas with the same content but have different descriptor ID (#14219).

One solution proposed by arma (and +1 by special) is to redesign the logic to be cleaner, and keep an intro point failure cache. It seems like every time we encounter this "we keep state about the previous hs desc by keeping a copy of it, and then seeing if the new one is different, get it?" logic, somebody finds it confusing. Maybe now is a good time for it to die? :)

Child Tickets

Change History (20)

comment:1 Changed 4 years ago by dgoulet

Description: modified (diff)

comment:2 Changed 4 years ago by dgoulet

Ok I took a stab at it so we have a starting point!

https://people.torproject.org/~dgoulet/volatile/client-hs-desc-cache.txt

I'm unsure how best to proceed thus I put it on people.tpo for now, maybe a pad at some point if the above is just off track :). It just felt a bit too big/much for a ticket and discussion.

Once we start the implementation, we MUST add the cache behavior as a fat comment, I would actually be incline to have seperate files with proper documentation on top for this to stop adding to the massive C files we currently have :).

comment:3 Changed 4 years ago by asn

Cc: asn added

comment:4 Changed 4 years ago by dgoulet

Status: newneeds_review

Here is an attempt at this new cache. Feedback is very welcome on both the code and cache behavior.

See branch: bug16389_027_01

comment:5 Changed 4 years ago by nickm

Status: needs_reviewneeds_revision

Suggestions:

  • I would be more cormfortable if cache_failure_intro_lookup set *entry to NULL on lookup failure. Leaving it unset seems error-prone to me.
  • use uint8_t* for pointers to bytes, and char* for pointers to text. So, if you're going to pass it as a key for a strmap, have it be char*; if you're going to pass it as a key for a digestmap, have it be uint8_t. And document what format each argument is in in the documentation, how long it should be, etc.
  • Make a rend_cache_failure_entry_new() and rend_cache_failure_entry_free(), even if they're just wrappers around tor_malloc and tor_free. This will make the structure easier to maintain if you add anything else to it.
  • As above but for rend_cache_failure_intro_t
  • In cache_failure_intro_add, maybe only add fail_entry to rend_cache_failure if it's new?
  • should "unsigned int failure" be an enum?
  • Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?


Bigger issues:

  • Should "signal NEWNYM" affect this cache? Guessing yes.
  • Should the entries in this cache ever expire? I think probably yes.
  • Removing the timestamp check in rend_cache_store_v2_desc_as_client() makes me worry about replacing a newer descriptor with an older one. Should I be worried about that?

comment:6 in reply to:  5 ; Changed 4 years ago by dgoulet

Replying to nickm:

Suggestions:

All suggestions are great! I will work on those!

  • should "unsigned int failure" be an enum?

It should! but currently the failures are #define hence the current unsigned int. I had a plan (volatile) to make a patch to change those to an enum eventually.

  • Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?

There shouldn't be any pending connections to those intro points because we are parsing the fetched descriptor and making sure the IP entries in there are in the failure cache or not. After validation, if the intro node list is empty it means we can't use any thus we goto err. At that point, no new connections have been established to those intro points nor are already established.


Bigger issues:

  • Should "signal NEWNYM" affect this cache? Guessing yes.

I thought about clearing this cache on a NEWNYM but I'm unsure... Failures are quite linked to the consensus and current "state" of the network that is if IPs aren't usuable, it goes beyond the scope of the cache and circuit state. A NEWNYM will clean the descriptor cache but I think we need to keep the IP failure so if we download again that same old descriptor, we won't initiate anything. I can be convinced that clearing this cache would be desirable like for example if we decide to link the IP expiry time to the descriptor expiry time. (see next question)

  • Should the entries in this cache ever expire? I think probably yes.

That was also a concern of mine. So this patch has a "natural rotation" for the cache entries meaning that when we download a new descriptor, it removes every IP that are not listed in the descriptor (for a service ID). But if an IP is re-used we can end up not using it even if it's suppose to work with that new descriptor.

So, what would be a wise expiry time? Here are some I can think of:

  1. Remove the service ID entry (meaning all the IP associated with an onion address) when the descriptor expires (rend_cache_clean). Currently a valid period of 24 hours.
  2. When we get a new consensus, we could consider that it indicates a "new era" of connectivity :)
  3. Expire them at each RendPostPeriod default value (what most HS will use?). (1 hour)

We could also expire entries at a different rate based on their failure type. A NACK is much more different than a UNREACHABLE for instance but that's maybe a future enhancement.

In terms of simplicity for now, I'm incline to use 1. especially since the branch in #4862 makes it that the HS only uploads its descriptor when all IPs have been established. We avoid the race of fetching the descriptor before an IP is ready and thus not using it for 24 hours.

  • Removing the timestamp check in rend_cache_store_v2_desc_as_client() makes me worry about replacing a newer descriptor with an older one. Should I be worried about that?

Hrm, I thought it was harmless to remove it because an HSDir doesn't accept an older descriptor but then that does fix the fact that an evil HSDir could simply serve older descriptor non stop and break clients for a long time making them add lots of IPs to their failure cache. Good catch, I'll put it back!

New branch coming soon.

comment:7 in reply to:  6 ; Changed 4 years ago by nickm

Replying to dgoulet:

Replying to nickm:

Suggestions:

All suggestions are great! I will work on those!

  • should "unsigned int failure" be an enum?

It should! but currently the failures are #define hence the current unsigned int. I had a plan (volatile) to make a patch to change those to an enum eventually.

Okay. Let's try to get that in with this?

  • Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?

There shouldn't be any pending connections to those intro points because we are parsing the fetched descriptor and making sure the IP entries in there are in the failure cache or not. After validation, if the intro node list is empty it means we can't use any thus we goto err. At that point, no new connections have been established to those intro points nor are already established.

Hm. I think I meant connection_edge connections from the client?

Bigger issues:

  • Should "signal NEWNYM" affect this cache? Guessing yes.

I thought about clearing this cache on a NEWNYM but I'm unsure... Failures are quite linked to the consensus and current "state" of the network that is if IPs aren't usuable, it goes beyond the scope of the cache and circuit state. A NEWNYM will clean the descriptor cache but I think we need to keep the IP failure so if we download again that same old descriptor, we won't initiate anything. I can be convinced that clearing this cache would be desirable like for example if we decide to link the IP expiry time to the descriptor expiry time. (see next question)

I think that clearing the cache is probably a good idea; otherwise there's a way to correlate access across a NEWNYM, right?

  • Should the entries in this cache ever expire? I think probably yes.

That was also a concern of mine. So this patch has a "natural rotation" for the cache entries meaning that when we download a new descriptor, it removes every IP that are not listed in the descriptor (for a service ID). But if an IP is re-used we can end up not using it even if it's suppose to work with that new descriptor.

So, what would be a wise expiry time? Here are some I can think of:

  1. Remove the service ID entry (meaning all the IP associated with an onion address) when the descriptor expires (rend_cache_clean). Currently a valid period of 24 hours.

I think that seems right. When we remove the descriptor entirely, we shouldn't remember the IP statuses in it.

  1. When we get a new consensus, we could consider that it indicates a "new era" of connectivity :)
  2. Expire them at each RendPostPeriod default value (what most HS will use?). (1 hour)

We could also expire entries at a different rate based on their failure type. A NACK is much more different than a UNREACHABLE for instance but that's maybe a future enhancement.

In terms of simplicity for now, I'm incline to use 1. especially since the branch in #4862 makes it that the HS only uploads its descriptor when all IPs have been established. We avoid the race of fetching the descriptor before an IP is ready and thus not using it for 24 hours.

  • Removing the timestamp check in rend_cache_store_v2_desc_as_client() makes me worry about replacing a newer descriptor with an older one. Should I be worried about that?

Hrm, I thought it was harmless to remove it because an HSDir doesn't accept an older descriptor but then that does fix the fact that an evil HSDir could simply serve older descriptor non stop and break clients for a long time making them add lots of IPs to their failure cache. Good catch, I'll put it back!

New branch coming soon.

Yaaaay.

comment:8 in reply to:  7 Changed 4 years ago by dgoulet

Status: needs_revisionneeds_review

Replying to nickm:

Replying to dgoulet:

Replying to nickm:

Suggestions:

[snip]

  • Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?

There shouldn't be any pending connections to those intro points because we are parsing the fetched descriptor and making sure the IP entries in there are in the failure cache or not. After validation, if the intro node list is empty it means we can't use any thus we goto err. At that point, no new connections have been established to those intro points nor are already established.

Hm. I think I meant connection_edge connections from the client?

On error, if the store function fails, the edge connection is closed so that extra new goto err is fine. RCS_BADDESC is sent back.

Ok I fixed everything else in branch: bug16389_027_02.

comment:9 Changed 4 years ago by arma

Roger's status: I think we need to ignore past intro point failures if they're from more than 5 minutes ago. Only very recent intro point failures should be remembered.

This is especially important if we merge #4862 too, since we're going to have onion services that go offline for a little while (during which Alice fails to reach it), and then it comes back and re-establishes its intro points, and then Alice tries to visit it again and preemptively gives up because her intro point failure cache says they will all fail

comment:10 Changed 4 years ago by nickm

Keywords: TorCoreTeam201507 added

comment:11 Changed 4 years ago by nickm

dgoulet: do you agree with Roger's comments? Does that make this needs_revision?

comment:12 in reply to:  11 Changed 4 years ago by dgoulet

Keywords: TorCoreTeam201508 added; TorCoreTeam201507 removed
Status: needs_reviewneeds_revision

Replying to nickm:

dgoulet: do you agree with Roger's comments? Does that make this needs_revision?

Yup I do agree, moving this back to needs_revision and will start working on a fix right now.

comment:13 Changed 4 years ago by dgoulet

Status: needs_revisionneeds_review

New branch that expires cache failure entry after 5 minutes. See top two commits for this feature.

Branch: bug16389_027_03

comment:14 Changed 4 years ago by nickm

Resolution: implemented
Status: needs_reviewclosed

Seems okay. Merging!

comment:15 Changed 4 years ago by nickm

Resolution: implemented
Status: closedreopened

Merged, but needs a changes file. I can try to write one later, but it would be cool if dgoulet could give it a try.

comment:16 Changed 4 years ago by arma

Did everybody just assume I picked the perfect number "5 minutes", or has anybody thought through the time period required here?

Basically, any time interval is suboptimal, in that waiting a long time will result in more instances of unhappy users who can't reach their HS, and waiting a short time will result in duplicate attempts.

It seems like we want a nonce in the hidden service descriptor, and in the cache we tag which descriptor we had when we tried it, and then expire -- no, that wouldn't work either, since the hidden service doesn't need to publish a new version of its descriptor if it's just reusing all its old intro points. So there isn't any explicit signal by the HS that it was gone for a while and has now returned so maybe you want to retry.

In sum: with this patch are we still going to have cases where Alice tries to visit a hidden service, finds it unavailable, then the HS goes back online and tells her to try again, and for five minutes she fails?

Maybe this argues for hidden services putting the nonce into their HS descs anyway, and publishing an updated HS desc every time they lose contact with their intro points, to give clients a way to recognize when the HS has acknowledged that things changed? Yuck.

comment:17 Changed 4 years ago by nickm

(added a changes file from dgoulet. Still pondering arma's question on and off)

comment:18 Changed 4 years ago by nickm

Description: modified (diff)
Keywords: TorCoreTeam201508 removed
Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

comment:19 Changed 4 years ago by dgoulet

Resolution: fixed
Status: reopenedclosed

arma's comment has been moved to ticket #16966 so we can close this one and the parent ticket as well. I see that has an improvement to the current cache design so we can move it to 028 milestone.

comment:20 Changed 4 years ago by dgoulet

Keywords: SponsorR removed
Sponsor: SponsorR
Note: See TracTickets for help on using tickets.