Opened 2 years ago

Closed 2 years ago

#23621 closed defect (duplicate)

prop224: Missing tons of mds over time with a lurking client

Reported by: asn Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version: Tor: 0.3.2.1-alpha
Severity: Normal Keywords: prop224
Cc: mikeperry, arma Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by dgoulet)

As roger described in comment:1:ticket:23543 tor will slowly expire its cached mds. Tor will start fetching again mds when it sees some client activity.

In the case of a prop224 HS client with a single rend circ open (e.g. to an ircd, or downloading something) Tor will consider itself inactive and will not fetch any mds during that time.

After 1-2 days of HS client operation, we will be missing about 2000 mds. If at that point we need to do another HS operation we will either stall if we are missing too many mds:

hs_client_refetch_hsdesc(): Can't fetch descriptor for service MUvEXhuYFWdElDqoHOeXPqi8TcKfLxeDKkYUejwLgt0 because we dont have enough descriptors. Stalling connection.

or will try to fetch a descriptor with some missing mds. This can result in delays and/or reachability issues.

That's not a problem for v2, since all the info for v2 are in the consensus and it does not need any mds to connect to HSes.

Perhaps it's good for us to periodically keep fetching mds if we have long-lasting active rend circs. (Sorta related to #23543 but not exactly).

Child Tickets

Attachments (1)

fetching_descs.log.gz (1.8 KB) - added by asn 2 years ago.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 2 years ago by dgoulet

Description: modified (diff)

comment:2 Changed 2 years ago by nickm

Status: newneeds_review

Have a look at bug23621_032 -- does it make sense to you as a fix for this?

comment:3 Changed 2 years ago by dgoulet

Cc: mikeperry arma added

comment:4 Changed 2 years ago by dgoulet

Status: needs_reviewneeds_revision

I'll take over the patch to add a torrc option to control the client usage delay.

comment:5 Changed 2 years ago by nickm

Owner: set to dgoulet
Status: needs_revisionassigned

setting owner

comment:6 Changed 2 years ago by nickm

Status: assignedneeds_revision

comment:7 Changed 2 years ago by dgoulet

Status: needs_revisionneeds_review

See branch: bug23621_032_01

Notice the change in rep_hist_client_dormant() which returns true if client last used is unset.

comment:8 Changed 2 years ago by arma

I'm not yet convinced that this is a good idea.

The circuit-building-dormancy thing makes us not fetch descriptors if we haven't needed descriptors for a long time. That's a feature.

I guess the theory here is that if we have an open conn, then we are likely to still need descriptors at any moment.

But the counter to that theory is that if we haven't needed descriptors for hours, it could well be more hours until we need them.

comment:9 in reply to:  description ; Changed 2 years ago by arma

Replying to asn:

After 1-2 days of HS client operation, we will be missing about 2000 mds. If at that point we need to do another HS operation [...] This can result in delays and/or reachability issues.

This sounds like a bug worth exploring and fixing. It will be a bug -- even if we do the fix proposed in this ticket -- for an actually idle Tor client that then tries to access a v3 onion service as its first action.

In theory in this situation it should notice that it needs to fetch microdescs, and do it, and then let the various circuits proceed. If that process is failing somehow, we should find out where and fix it.

comment:10 Changed 2 years ago by arma

The ClientUsageDelay feature here, when set to exactly 1 hour, will have interesting side-channel leaks. For example, if you go to a directory guard to fetch new microdescs, it knows that there's a good chance you had been idle but you just sent something on an existing circuit to your primary guard.

Mike worked hard to remove those side channels in the #17592 fix (see e.g. commit d5a151a0).

comment:11 in reply to:  9 Changed 2 years ago by asn

Status: needs_reviewneeds_information

Replying to arma:

Replying to asn:

After 1-2 days of HS client operation, we will be missing about 2000 mds. If at that point we need to do another HS operation [...] This can result in delays and/or reachability issues.

This sounds like a bug worth exploring and fixing. It will be a bug -- even if we do the fix proposed in this ticket -- for an actually idle Tor client that then tries to access a v3 onion service as its first action.

Based on Roger's comments let's put this in needs_info until we learn more info about this issue or we figure out a better approach.

In theory in this situation it should notice that it needs to fetch microdescs, and do it, and then let the various circuits proceed. If that process is failing somehow, we should find out where and fix it.

BTW, I think prop224 clients will actually do the above suggestion since we require router_have_minimum_dir_info() in hs_client_refetch_hsdesc(). That requires about 80% of the dirinfo in my testing. Tor clients will block until they get those mds if this behavior occurs (this can take up to 10 mins based on our IRC logs).

comment:12 Changed 2 years ago by dgoulet

Status: needs_informationneeds_revision

After discussing this with armadev on IRC, a possible avenue to try to fix this is investigate why it can take us up to 10 minutes to get up to min dirinfo?

This ties into the global problem that prop224 needs a huge chunk of the total descriptors to function properly...

comment:13 Changed 2 years ago by arma

Yes, I'm even more convinced that the fix here is not to scale back the dormancy feature.

The design problem here is that v3 onion service connections are brittle when it comes to missing some microdescs. And this brittleness appears to be exposing bugs in how quickly we fetch all the missing microdescs when we want to (re)bootstrap. Identifying and fixing those bugs is a fine direction to explore.

comment:14 in reply to:  12 Changed 2 years ago by asn

Replying to dgoulet:

After discussing this with armadev on IRC, a possible avenue to try to fix this is investigate why it can take us up to 10 minutes to get up to min dirinfo?

I started this investigation to see how fast we will receive mds when we ask for them. The results are very good! When we fetch mds, we always receive them super-fast (within seconds). I attach an email from my prop224 HS client which demonstrates this.

Snippet:

Sep 26 01:25:56.000 [info] launch_descriptor_downloads(): Launching 3 requests for 994 microdescs, 332 at a time
Sep 26 01:25:57.000 [info] handle_response_fetch_microdesc(): Received answer to microdescriptor request (status 200, body size 156215) from server '137.205.124.35:1720'
Sep 26 01:25:57.000 [info] handle_response_fetch_microdesc(): Received answer to microdescriptor request (status 200, body size 174579) from server '213.136.81.89:9001'
Sep 26 01:25:57.000 [info] handle_response_fetch_microdesc(): Received answer to microdescriptor request (status 200, body size 155635) from server '213.136.81.89:9001'
Sep 26 07:17:57.000 [info] launch_descriptor_downloads(): Launching 3 requests for 179 microdescs, 60 at a time
Sep 26 07:17:57.000 [info] handle_response_fetch_microdesc(): Received answer to microdescriptor request (status 200, body size 25721) from server '188.165.220.21:9001'
Sep 26 07:17:57.000 [info] handle_response_fetch_microdesc(): Received answer to microdescriptor request (status 200, body size 25758) from server '188.165.220.21:9001'
Sep 26 07:17:57.000 [info] handle_response_fetch_microdesc(): Received answer to microdescriptor request (status 200, body size 29796) from server '188.165.220.21:9001'

I guess the next step here is to investigate whether our triggers for fetching mds are correct, and how they interact with prop224.

Last edited 2 years ago by asn (previous) (diff)

Changed 2 years ago by asn

Attachment: fetching_descs.log.gz added

comment:15 Changed 2 years ago by asn

Status: needs_revisionneeds_information

comment:16 Changed 2 years ago by dgoulet

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Resolution: duplicate
Status: needs_informationclosed

I think this is being investigated and hopefully fixed with #21969 and friends. I'm going to close this one because we know it is a problem now and this ticket doesn't do us much good since we agree that it is a tor problem and not an HS problem (if any).

Feel free to re-open if needed.

Note: See TracTickets for help on using tickets.