#23466 closed defect (fixed)

hs: v3 client descriptor cache entry should be indexed by blinded key

Reported by: dgoulet Owned by:
Priority: High Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, tor-client, prop224
Cc: Actual Points:
Parent ID: Points:
Reviewer: dgoulet Sponsor:

Description

We currently index hs_cache_client_descriptor_t by service identity key in the hs_cache_v3_client but imagine this scenario (that our mad bug hunter armadev stumbled upon):

Day 0: You fetch the descriptor for service S with rev counter 8 (blinded key: BK0)

Day 1: Computer in suspend mode to tor is sleepy-sleepy.

Day 2: Wake up, and try to access the descriptor of blinded key BK2 so you lookup the cache for service S and get the descriptor you fetched on Day 0 (yes because lifetime is ridiculously long that is 72h). It happens that the revision counter at that time is 5. Because 8 > 5, you won't keep the descriptor and trigger this log info:

    if (cache_entry->desc->plaintext_data.revision_counter >
        client_desc->desc->plaintext_data.revision_counter) {
      log_info(LD_REND, "We already have fresher descriptor. Ignoring.");

... and end up using the descriptor of Day 0 which is just unusable.

There are several issues here:

  1. The lifetime of a descriptor in our cache is just insanely too big. In reality, a descriptor blinded key has a lifetime of 24h maximum so it really doesn't make sense for a client to keep a descriptor more than that in its cache. Furthermore, because the hashring is computed using the consensus valid_after time, we don't need to add some buffer for clock skewed client. Thus, let bring it down to 24h only. We could be clever and put its expire time up to the next time period where the client will fetch a descriptor using a different blinded key.
  1. We need to index the client descriptor cache entries by blinded key which is what the HSDir do.
  1. I kind of want to turn this log_info() into a BUG() because we should never be triggering directory fetches for a blinded key for which we already have a valid descriptor. Rev counter for a specific blinded key can NOT go backward directory side nor client side. And can't be reused either over time.

Child Tickets

Change History (12)

comment:1 Changed 20 months ago by asn

Status: newneeds_review

Hello,

please see branch bug23466 in my repo which implements (a) from above. FWIW, I don't think setting the expiration date to 24 hours will work, since it can still leave the descriptor hanging after the time period is over (and cause roger's bug). So instead, I'm now using the consensus time to set the expiration time to be at the start of the next time period, as you suggested. I also did (c) as part of it since it seemed like a good idea. See the commit msg for more details, please.

I also started implementing approach (b) but it's actually not that easy. Using the blinded key when storing the descriptor is easy. However, using the blinded key to lookup the descriptor is not as easy, since we lookup descriptors in many many situations (see callers of hs_cache_lookup_as_client()) and in many of those situations the blinded key is not readily available which means we have to compute it on the spot (which could cause race conditions if we previously computed it using a different TP/SRV), or carry it in all sorts of ident objects (and others) which is not trivial to implement. Let's start looking into this if (a) fails.

comment:2 Changed 20 months ago by dgoulet

Reviewer: asndgoulet
Status: needs_reviewneeds_revision

Ok, I was able to map your a,b,c notation with my 1,2,3 notation so good :)

I agree with your (a) approach or (1).

Yeah, I had some doubts about the feasibility of using the blinded key client side but it should work out well with a good expiry time.

As for (b), great.

  • We might want to remove this because we have a BUG() now just before:
       log_info(LD_REND, "We already have fresher descriptor. Ignoring.");
  • Why remove "now" from the parameter list? hs_cache_clean_as_client() is called from the main loop with a now() already set. Furthermore, I'm curious how the test can work with a consensus time in 1985 and the clean cache function using approx_time() ? The cache entry will simply get removed all the time no?

comment:3 Changed 20 months ago by asn

Status: needs_revisionmerge_ready

Addressed David's suggestions and pushed new branch at bug23466 in my repo. This is ready for Nick review now. Let me know if you have any questions!

comment:4 Changed 20 months ago by nickm

I'm not sure about the part where we clean the whole cache on every call to hs_cache_lookup_as_client(). Wouldn't it make more sense to do this purge periodically instead, or upon receiving a new networkstatus? Otherwise, the lookup becomes linear. What do you think?

Otherwise this looks okay.

comment:5 Changed 20 months ago by dgoulet

hmmm... cache_clean_v3_as_client() only cleans up expired entries, not the entire cache.

We also do it periodically every 30 minutes through the clean_caches_callback().

The reason to also do it on lookup is because 30 minutes is pretty large so if you have an entry expiring at 13:00 and the next callback is at 13:29, you have this 29 minutes gap in between :S.

I see your point of making our lookup linear now... It's either that or we make our callback periodically kind of faster than that.

Thoughts?

Last edited 20 months ago by dgoulet (previous) (diff)

comment:6 Changed 20 months ago by nickm

I think it's okay to do the callback more frequently, or even just every 15-30 minutes. Just make sure that the lookup function checks for expired entries and doesn't return them?

comment:7 Changed 20 months ago by nickm

Status: merge_readyneeds_revision

comment:8 in reply to:  6 Changed 20 months ago by dgoulet

Replying to nickm:

I think it's okay to do the callback more frequently, or even just every 15-30 minutes. Just make sure that the lookup function checks for expired entries and doesn't return them?

@asn: so basically, it seems that we could simply remove the clean up function from the lookup() so we keep the lookup O(1) and a gap of 30 minutes after the new time period is fine because service will still be reachable with that previous descriptor so it is very acceptable.

And we make the lookup() function not return expired entries.

comment:9 Changed 20 months ago by asn

Status: needs_revisionmerge_ready

ACK guys. Makes sense.

Pushed a fixup on my bug23466 branch that should implement the behavior that dgoulet specified in comment:8.

Putting it on needs_review for now. David merge_ready it when you've ACKed.

comment:10 Changed 20 months ago by asn

Status: merge_readyneeds_review

comment:11 Changed 20 months ago by dgoulet

Status: needs_reviewmerge_ready

ack.

comment:12 Changed 20 months ago by nickm

Resolution: fixed
Status: merge_readyclosed

merging!

Note: See TracTickets for help on using tickets.