Opened 4 years ago
Closed 4 years ago
#13664 closed defect (fixed)
Potential issue with rend cache object when intro points falls to 0.
Reported by: | dgoulet | Owned by: | |
---|---|---|---|
Priority: | Medium | Milestone: | Tor: 0.2.6.x-final |
Component: | Core Tor/Tor | Version: | |
Severity: | Keywords: | tor-hs 025-backport | |
Cc: | Actual Points: | ||
Parent ID: | Points: | ||
Reviewer: | Sponsor: |
Description
(Reproduced on Tor v0.2.6.1-alpha-dev (git-a142fc29aff4b476))
Here is the use case I was testing. I setup an HS on a remote server for perf analysis. On my client, I made a small script that torsocks 10 connections on a different circuit to that HS (considering that the SOCKS5 user/pass == unique circuit works).
With the above, one time out of 10, I get all 10 connections to successfully connect and work. The rest of the time I get an arbitrary amout of connections failing with "Host unreachable". I feel this is a combo of sometimes luck and sometimes the real issue.
I analyze this and my understanding is that the rend cache contains v2 descriptor with stored intro points ("intro_nodes" variable). However, through the cycle of trying to connect, some intro points may be unrechable thus being removed from that list. It also appears that we can remove nodes in that list when closing circuit that were built in "parallel":
Nov 04 15:36:08.000 [info] rend_client_close_other_intros(): Closing introduction circuit 25 that we built in parallel (Purpose 7). Nov 04 15:36:08.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f6f1a171190 for circ_id 2434373038, channel ID 0 (0x7f6f1a0425e0) Nov 04 15:36:08.000 [info] circuit_mark_for_close_(): Failed intro circ rejxmpqgho5vqdl4 to $EBE718E1A49EE229071702964F8DB1F318075FF8 (awaiting ack). Removing from descriptor.
circuit_mark_for_close_() triggers a INTRO_POINT_FAILURE_GENERIC failure that removes the intro point from the list. I might be wrongly interpreting the "we built in parallel" feature but what I can observed is that the intro node list becomes empty at some point which triggers a "let's refetch that v2 descriptor!" behaviour.
Nov 04 15:36:08.000 [info] rend_client_report_intro_point_failure(): Unknown service "rejxmpqgho5vqdl4". Re-fetching descriptor.
However, the rend cache is not cleared of the old entry before fetching that new descriptor. So once the v2 descriptor is received, we store it in the cache using "rend_cache_store_v2_desc_as_client()" that prints this:
Nov 04 15:36:09.000 [info] rend_cache_store_v2_desc_as_client(): We already have this service descriptor rejxmpqgho5vqdl4. [rendezvous-service-descriptor i7hkcux5dghqv6ahstewyccltr6aud2x
So since we "have it" in the cache, we call "rend_client_desc_trynow()" and it completely fails because all intro points in the cache object are gone so this closes all pending connections.
Now, I think this happens because the heuristic for telling if "We already have the cache object" is just by comparing the "desc" string here in rendcommon.c +1156
/* Do we already have this descriptor? */ if (e && !strcmp(desc, e->desc)) { log_info(LD_REND,"We already have this service descriptor %s. [%s]", safe_str_client(service_id), desc); e->received = time(NULL); goto okay; }
I think when the intro point list ends up to 0 node, we should remove it from the cache and trigger the "fetch it again".
Child Tickets
Change History (11)
comment:1 Changed 4 years ago by
Keywords: | 025-backport added |
---|---|
Milestone: | → Tor: 0.2.6.x-final |
comment:2 Changed 4 years ago by
Sure, on my list for tomorrow! I just don't have any more brain cells today after analyzing this issue in depth. :)
comment:3 Changed 4 years ago by
Okay! If this bug exists in 0.2.4 or 0.2.5, and it has a major effect on hidden service reachability, maybe write the patch against maint-0.2.4 or maint-0.2.5 so we can do the backport more easily?
comment:4 Changed 4 years ago by
Ok! I'll make sure to provide you branches for the backport if any issues arise when merging it. I'll update the ticket if this needs to go in earlier version.
comment:5 Changed 4 years ago by
Okay. To clarify, the usual procedure would be something like: create a branch based on maint-0.2.4 or maint-0.2.5 called something like "bug13664_025". Then fix the bug there, and only make an extra branch if that branch doesn't merge cleanly into master.
(Not sure if this is what you're saying or not.)
comment:6 Changed 4 years ago by
Patch in my user repo. Branch: bug13664_025_v1
Notice that I made the rend cache lookup function use the specific version number given as an argument since every callsite was using -1 and only working on v2 descriptor. It seems appropriate to do that for a better integration and I possible v3 version. If you feel uncomfortable with that, I can remove it for now and just keep the rend_cache_build_key that is needed.
comment:7 Changed 4 years ago by
Replying to dgoulet:
(Reproduced on Tor v0.2.6.1-alpha-dev (git-a142fc29aff4b476))
<snip>
However, the rend cache is not cleared of the old entry before fetching that new descriptor. So once the v2 descriptor is received, we store it in the cache using "rend_cache_store_v2_desc_as_client()" that prints this:
Nov 04 15:36:09.000 [info] rend_cache_store_v2_desc_as_client(): We already have this service descriptor rejxmpqgho5vqdl4. [rendezvous-service-descriptor i7hkcux5dghqv6ahstewyccltr6aud2xSo since we "have it" in the cache, we call "rend_client_desc_trynow()" and it completely fails because all intro points in the cache object are gone so this closes all pending connections.
Hello,
three questions to better understand the issue:
a) This behavior only triggers when all the IPs are down or found unreachable, right?
b) How does Tor finally escape this loop in the current codebase? Since no IPs are reachable and no new descriptor can be fetched?
c) IIUC, the desc
in if (e && !strcmp(desc, e->desc)) {
is the whole descriptor as a string. So if that check succeeds, we already have the exact same descriptor? How would it help if we kept that same descriptor, since all the IPs listed in there are apparently unreachable? Would it just try to reconnect to one of those IPs, in case the connection succeeds?
Three stupid comments on code style:
a) NULL byte
is usually referenced as NUL
in Tor code. AFAIK, that's the official ASCII name of \0
. NULL
is the C macro.
b) Function arguments that are supposed to be filled with stuff are usually named something_out
in Tor code. Maybe rend_cache_build_key
should take key_out
/key_len_out
instead of key
/len
?
comment:8 Changed 4 years ago by
Status: | new → needs_review |
---|
comment:9 Changed 4 years ago by
I agree with asn; I don't understand this solution. If the intro points are actually down, it looks like we'd try each intro point once for each directory mirror, re-fetching exactly the same descriptor every time.
Re-reading your original description of the problem:
However, through the cycle of trying to connect, some intro points may be unrechable thus being removed from that list. It also appears that we can remove nodes in that list when closing circuit that were built in "parallel":
circuit_mark_for_close_() triggers a INTRO_POINT_FAILURE_GENERIC failure that removes the intro point from the list. I might be wrongly interpreting the "we built in parallel" feature but what I can observed is that the intro node list becomes empty at some point which triggers a "let's refetch that v2 descriptor!" behaviour.
It sounds like the problem is intro points being removed from the list when they're not actually down. The descriptor re-fetching is meant to handle a different case, where our descriptor might be outdated.
I think the solution is to retry introduction points that fail "non-authoritatively" (i.e. as a result of network conditions other than the IP itself). Have I understood correctly?
comment:10 Changed 4 years ago by
Looking more in depth, this is caused by multiple issues that are here: #13698, #13699
I'm still going to answer @asn questions but this bug probably will be resolved by the above.
a) This behavior only triggers when all the IPs are down or found unreachable, right?
Bug #13698 explain how the intro points are flagged unreachable but it's actually really wrong.
b) How does Tor finally escape this loop in the current codebase? Since no IPs are reachable and no new descriptor can be fetched?
That I haven't quite fully figure out yet. One possibility is that the rend cache object times out after REND_CACHE_MAX_AGE (2*24*60*60)
but that's way too long for a client to not being able to reach an HS without having a lot of people scream on IRC/email.
Second thing I can observed is that before all the intro points are removed from the cache, some actually succeed! So the next connection looks like it uses that circuit.
However when that circuit closes, I think all bets are off as far as I can tell meaning the specific HS will not be reachable until the cache object is removed and a new one is fetched.
Nov 07 10:45:40.000 [info] pathbias_count_use_attempt(): Used circuit 48 is already in path state use succeeded. Circuit is a Hidden service client: Active rendezvous point currently open.
c) IIUC, the desc in if (e && !strcmp(desc, e->desc)) { is the whole descriptor as a string. So if that check succeeds, we already have the exact same descriptor? How would it help if we kept that same descriptor, since all the IPs listed in there are apparently unreachable? Would it just try to reconnect to one of those IPs, in case the connection succeeds?
The problem is that we refetch the descriptor, parse all the intro points and at the end we just look if the desc
is the same and bail out if yes. Thus the object becomes useless even if the current code realize that the intro points are empty. You can see this behaviour in rend_cache_store_v2_desc_as_client()
FYI, that's now bug #13699.
Also, thanks for the code style pointers!
comment:11 Changed 4 years ago by
Resolution: | → fixed |
---|---|
Status: | needs_review → closed |
Is this something you think you can write the patch for?