Opened 4 years ago

Closed 4 years ago

#13664 closed defect (fixed)

Potential issue with rend cache object when intro points falls to 0.

Reported by: dgoulet Owned by:
Priority: Medium Milestone: Tor: 0.2.6.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-hs 025-backport
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

(Reproduced on Tor v0.2.6.1-alpha-dev (git-a142fc29aff4b476))

Here is the use case I was testing. I setup an HS on a remote server for perf analysis. On my client, I made a small script that torsocks 10 connections on a different circuit to that HS (considering that the SOCKS5 user/pass == unique circuit works).

With the above, one time out of 10, I get all 10 connections to successfully connect and work. The rest of the time I get an arbitrary amout of connections failing with "Host unreachable". I feel this is a combo of sometimes luck and sometimes the real issue.

I analyze this and my understanding is that the rend cache contains v2 descriptor with stored intro points ("intro_nodes" variable). However, through the cycle of trying to connect, some intro points may be unrechable thus being removed from that list. It also appears that we can remove nodes in that list when closing circuit that were built in "parallel":

Nov 04 15:36:08.000 [info] rend_client_close_other_intros(): Closing introduction circuit 25 that we built in parallel (Purpose 7).
Nov 04 15:36:08.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f6f1a171190 for circ_id 2434373038, channel ID 0 (0x7f6f1a0425e0)
Nov 04 15:36:08.000 [info] circuit_mark_for_close_(): Failed intro circ rejxmpqgho5vqdl4 to $EBE718E1A49EE229071702964F8DB1F318075FF8 (awaiting ack). Removing from descriptor.

circuit_mark_for_close_() triggers a INTRO_POINT_FAILURE_GENERIC failure that removes the intro point from the list. I might be wrongly interpreting the "we built in parallel" feature but what I can observed is that the intro node list becomes empty at some point which triggers a "let's refetch that v2 descriptor!" behaviour.

Nov 04 15:36:08.000 [info] rend_client_report_intro_point_failure(): Unknown service "rejxmpqgho5vqdl4". Re-fetching descriptor.

However, the rend cache is not cleared of the old entry before fetching that new descriptor. So once the v2 descriptor is received, we store it in the cache using "rend_cache_store_v2_desc_as_client()" that prints this:

Nov 04 15:36:09.000 [info] rend_cache_store_v2_desc_as_client(): We already have this service descriptor rejxmpqgho5vqdl4. [rendezvous-service-descriptor i7hkcux5dghqv6ahstewyccltr6aud2x

So since we "have it" in the cache, we call "rend_client_desc_trynow()" and it completely fails because all intro points in the cache object are gone so this closes all pending connections.

Now, I think this happens because the heuristic for telling if "We already have the cache object" is just by comparing the "desc" string here in rendcommon.c +1156

  /* Do we already have this descriptor? */
  if (e && !strcmp(desc, e->desc)) {
    log_info(LD_REND,"We already have this service descriptor %s. [%s]",
             safe_str_client(service_id), desc);
    e->received = time(NULL);
    goto okay;
  }

I think when the intro point list ends up to 0 node, we should remove it from the cache and trigger the "fetch it again".

Child Tickets

Change History (11)

comment:1 Changed 4 years ago by nickm

Keywords: 025-backport added
Milestone: Tor: 0.2.6.x-final

Is this something you think you can write the patch for?

comment:2 Changed 4 years ago by dgoulet

Sure, on my list for tomorrow! I just don't have any more brain cells today after analyzing this issue in depth. :)

comment:3 Changed 4 years ago by nickm

Okay! If this bug exists in 0.2.4 or 0.2.5, and it has a major effect on hidden service reachability, maybe write the patch against maint-0.2.4 or maint-0.2.5 so we can do the backport more easily?

comment:4 Changed 4 years ago by dgoulet

Ok! I'll make sure to provide you branches for the backport if any issues arise when merging it. I'll update the ticket if this needs to go in earlier version.

comment:5 Changed 4 years ago by nickm

Okay. To clarify, the usual procedure would be something like: create a branch based on maint-0.2.4 or maint-0.2.5 called something like "bug13664_025". Then fix the bug there, and only make an extra branch if that branch doesn't merge cleanly into master.

(Not sure if this is what you're saying or not.)

comment:6 Changed 4 years ago by dgoulet

Patch in my user repo. Branch: bug13664_025_v1

Notice that I made the rend cache lookup function use the specific version number given as an argument since every callsite was using -1 and only working on v2 descriptor. It seems appropriate to do that for a better integration and I possible v3 version. If you feel uncomfortable with that, I can remove it for now and just keep the rend_cache_build_key that is needed.

comment:7 in reply to:  description Changed 4 years ago by asn

Replying to dgoulet:

(Reproduced on Tor v0.2.6.1-alpha-dev (git-a142fc29aff4b476))

<snip>

However, the rend cache is not cleared of the old entry before fetching that new descriptor. So once the v2 descriptor is received, we store it in the cache using "rend_cache_store_v2_desc_as_client()" that prints this:

Nov 04 15:36:09.000 [info] rend_cache_store_v2_desc_as_client(): We already have this service descriptor rejxmpqgho5vqdl4. [rendezvous-service-descriptor i7hkcux5dghqv6ahstewyccltr6aud2x

So since we "have it" in the cache, we call "rend_client_desc_trynow()" and it completely fails because all intro points in the cache object are gone so this closes all pending connections.

Hello,

three questions to better understand the issue:
a) This behavior only triggers when all the IPs are down or found unreachable, right?

b) How does Tor finally escape this loop in the current codebase? Since no IPs are reachable and no new descriptor can be fetched?

c) IIUC, the desc in if (e && !strcmp(desc, e->desc)) { is the whole descriptor as a string. So if that check succeeds, we already have the exact same descriptor? How would it help if we kept that same descriptor, since all the IPs listed in there are apparently unreachable? Would it just try to reconnect to one of those IPs, in case the connection succeeds?

Three stupid comments on code style:
a) NULL byte is usually referenced as NUL in Tor code. AFAIK, that's the official ASCII name of \0. NULL is the C macro.

b) Function arguments that are supposed to be filled with stuff are usually named something_out in Tor code. Maybe rend_cache_build_key should take key_out/key_len_out instead of key/len?

comment:8 Changed 4 years ago by nickm

Status: newneeds_review

comment:9 Changed 4 years ago by special

I agree with asn; I don't understand this solution. If the intro points are actually down, it looks like we'd try each intro point once for each directory mirror, re-fetching exactly the same descriptor every time.

Re-reading your original description of the problem:

However, through the cycle of trying to connect, some intro points may be unrechable thus being removed from that list. It also appears that we can remove nodes in that list when closing circuit that were built in "parallel":

circuit_mark_for_close_() triggers a INTRO_POINT_FAILURE_GENERIC failure that removes the intro point from the list. I might be wrongly interpreting the "we built in parallel" feature but what I can observed is that the intro node list becomes empty at some point which triggers a "let's refetch that v2 descriptor!" behaviour.

It sounds like the problem is intro points being removed from the list when they're not actually down. The descriptor re-fetching is meant to handle a different case, where our descriptor might be outdated.

I think the solution is to retry introduction points that fail "non-authoritatively" (i.e. as a result of network conditions other than the IP itself). Have I understood correctly?

comment:10 Changed 4 years ago by dgoulet

Looking more in depth, this is caused by multiple issues that are here: #13698, #13699

I'm still going to answer @asn questions but this bug probably will be resolved by the above.

a) This behavior only triggers when all the IPs are down or found unreachable, right?

Bug #13698 explain how the intro points are flagged unreachable but it's actually really wrong.

b) How does Tor finally escape this loop in the current codebase? Since no IPs are reachable and no new descriptor can be fetched?

That I haven't quite fully figure out yet. One possibility is that the rend cache object times out after REND_CACHE_MAX_AGE (2*24*60*60) but that's way too long for a client to not being able to reach an HS without having a lot of people scream on IRC/email.

Second thing I can observed is that before all the intro points are removed from the cache, some actually succeed! So the next connection looks like it uses that circuit.

However when that circuit closes, I think all bets are off as far as I can tell meaning the specific HS will not be reachable until the cache object is removed and a new one is fetched.

Nov 07 10:45:40.000 [info] pathbias_count_use_attempt(): Used circuit 48 is already in path state use succeeded. Circuit is a Hidden service client: Active rendezvous point currently open.

c) IIUC, the desc in if (e && !strcmp(desc, e->desc)) { is the whole descriptor as a string. So if that check succeeds, we already have the exact same descriptor? How would it help if we kept that same descriptor, since all the IPs listed in there are apparently unreachable? Would it just try to reconnect to one of those IPs, in case the connection succeeds?

The problem is that we refetch the descriptor, parse all the intro points and at the end we just look if the desc is the same and bail out if yes. Thus the object becomes useless even if the current code realize that the intro points are empty. You can see this behaviour in rend_cache_store_v2_desc_as_client()

FYI, that's now bug #13699.

Also, thanks for the code style pointers!

comment:11 Changed 4 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Closing this because #13698 is marked for backport and merged into 0.2.6, whereas #13699 is closed as invalid.

Note: See TracTickets for help on using tickets.