Opened 3 months ago

Closed 7 days ago

#25705 closed defect (implemented)

Refactor circuit_build_failed to separate build vs path failures

Reported by: mikeperry Owned by:
Priority: Medium Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 033-backport
Cc: arma, asn Actual Points:
Parent ID: #25546 Points:
Reviewer: asn Sponsor: SponsorV-can

Description

We should not give up on the network, our TLS conn, or our guard in the event of path failures (which can happen if we're low on mds, and/or if the user set a bunch of path-restricting torrc options).

I think this might want to be a backport. It should also handle #25347.

Child Tickets

Change History (13)

comment:1 Changed 3 months ago by mikeperry

Status: newneeds_review

Ok, I made the smallest possible change here. I just moved the path length check up from the guard failure block to the beginning of the function. The only thing that wasn't build failure accounting or node/network blaming was the HS RP retry code, which I also moved up.

I also removed the destroy check, as per #25347, as a separate commit. We can decide which one we like better. On the one hand, asn's branch has more checks and might be safer. On the other hand, this is a smaller change, and keeps everything related to counting circuit failures in the same place.

mikeperry/bug25705

comment:2 in reply to:  1 Changed 3 months ago by asn

Status: needs_reviewneeds_revision

Replying to mikeperry:

Ok, I made the smallest possible change here. I just moved the path length check up from the guard failure block to the beginning of the function. The only thing that wasn't build failure accounting or node/network blaming was the HS RP retry code, which I also moved up.

I also removed the destroy check, as per #25347, as a separate commit. We can decide which one we like better. On the one hand, asn's branch has more checks and might be safer. On the other hand, this is a smaller change, and keeps everything related to counting circuit failures in the same place.

mikeperry/bug25705

Thanks for the patch here Mike! A github RP would be helpful so that I can comment inline, but I'll just do it here for now:

  • 4c64811d0b: Looks reasonable to me. I find it kinda naughty that we special handle S_CONNECT_REND twice in that function, but I didnt find a way to improve this in a straightforward way.
  • a2d44546c2: A few things here:
  • Shouldn't we do the already_marked computation even when circ->base_.received_destroy, so that we don't double-mark? Since, IIUC the only reason we distinguish between receiving DESTROY or not now, is so that we can do better logging. I think that also has the potential to simplify the code a bit. Ideally I think that whole block should go into its own function.
  • Do we want to mark the guard as bad for any DESTROY reason? Is there a reason we wouldn't do that? Do we remember our old rationale here. Seems like the commit that introduced the don't do anything if DESTROY logic is Roger in 5de88dda0acda6914bacd1e14c2e037798c5d9d9.
  • Do we want to merge this patch without taking the precautions of Roger's points G and E from #25347?
  • The code needed a parenthesis to compile. We should test this new code, it seems ultra crucial!
Version 1, edited 3 months ago by asn (previous) (next) (diff)

comment:3 Changed 3 months ago by mikeperry

Status: needs_revisionneeds_review

I don't feel super strongly about the second commit. I threw it in as an afterthought so we could compare it against #25347. I guess I forgot to even check that it compiled :/

I think we can drop it, subject to a plan to move to two guards. I do not think it is at all safe to stay with one guard *and* allow guards to be DoS'd to the point where client activity ceases. That is a clear confirmation signal that can be used to build a guard discovery attack.

https://oniongit.eu/mikeperry/tor/commits/bug25705_v2 (just the first commit)

comment:4 Changed 3 months ago by asn

Posting a discussion with armadev from yesterday which might be relevant to this ticket. I dont have enough time to dig into this right now, so I'll just post the raw logs:

16:01 <+armadev> yeah, i am expecting that when i read 25705, i will want to remind people that right now we *do* want to call some of that code for circuits where we 
                 failed to pick the first hop,
16:01 <+armadev> because that is how we rate limit the amount of thrashing that tor does when it goes offline and stays there for a while
16:02 <+asn> hmm
16:02 <+asn> what you mean?
16:02 <+asn> i have not considered that
16:02 <+armadev> that is, if you're offline and you've marked all your tors down, you wake up every so often, and try some circuits, and they all fail because you don't 
                 think anything is up, and you sleep again for a bit
16:02 <+armadev> it is possible that prop#271 changed that behavior
16:02 -zwiebelbot:#tor-dev- Prop#271: Another algorithm for guard selection [CLOSED]
16:03 <+asn> we sleep for a bit?
16:03 <+asn> do we?
16:03 <+asn> i dont think we do? either with prop271 or before
16:03 <+asn> do we?
16:04 <+armadev> check out did_circs_fail_last_period
16:04 <+armadev> if we have MAX_CIRCUIT_FAILURES failures in this period,
16:04 <+armadev> and also we had MAX_CIRCUIT_FAILURES in the last period
16:04 <+armadev> then we abort all circuits
16:05 <+armadev> that's part of why it is important to not have too many failures that aren't really failures
16:06 <+armadev> because if you have too many, you'll trigger the "woah stop for a bit" feature
16:06 <+asn> hmmm
16:06 <+armadev> which is good if they're really failures
16:06 <+asn> i was not aware of this
16:06 <+asn> feature
16:07 <+armadev> now you are! :)
16:07 <+armadev> it is mostly for tors that are actually offline right now
16:07 <+armadev> so they thrash less
16:07 <+armadev> and in that respect it's probably related to the "use less cpu when not in active use" tickets

comment:5 Changed 3 months ago by mikeperry

How about a log_fn_ratelim(LOG_NOTICE/LOG_WARN) here instead of the info log, then? This is not something that should happen unless the user is doing something crazy in torrc, or there is another bug (such as not having enough mds and never getting more). In each case they/we want to know that it happened, and I also think that not giving up in these cases is a good move, even if it costs a bit of spinning.

comment:6 Changed 3 months ago by dgoulet

Milestone: Tor: 0.3.4.x-final

comment:7 Changed 3 months ago by dgoulet

Reviewer: asn

comment:8 Changed 3 months ago by asn

Status: needs_reviewmerge_ready

I don't think arma's concern above was correct. I just tested mike's branch (and master) with an offline network and made sure that the MAX_CIRCUIT_FAILURES spinning protection kicks in. That happens because this ticket does not report failures only in the case of path selection failures. In the case of an offline network, the failures are not path selection related (they are circuit build related), so the failed circuits are counted correctly.

I took a second look at Mike's patch and looks good to me. Marking this as merge_ready.

comment:9 Changed 3 months ago by nickm

Roger, are you persuaded? Asn's testing seems convincing to me, and the code looks okay, to the extent that I understand the subtleties.

Mike and Asn, how far do you think we might want to backport it? It's generally most convenient to have a branch that we plan to backport based on the earliest point to which we might want to merge it.

comment:10 Changed 3 months ago by mikeperry

Ok I added the ratelimit log message and put this on maint-0.3.3 under mikeperry/bug25705_v3_033.

I think an 0.3.3 backport makes sense, because it would be nice to have this type of checking in place for the HSLayerXNodes options. I am less sure it needs a further backport since we nacked the #25347-related change.

In addition to what asn said, the other thing that makes Roger's second concern not happen is that this patch bails before incrementing n_circuit_failures. So these failures won't trigger the "woah go to sleep" property. I believe that is exactly what we want for these types of failures, though. They should not cause us to blame the guard or give up on the network. Neither are at fault.

comment:11 Changed 2 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: 0.3.3.x-final

Merged to 0.3.4; marking for backport to 0.3.3 assuming nothing breaks.

comment:12 Changed 6 weeks ago by teor

Keywords: 033-backport added

comment:13 Changed 7 days ago by nickm

Resolution: implemented
Status: merge_readyclosed

Seems not to have broken anything in 0.3.4; backporting to 0.3.3.

Note: See TracTickets for help on using tickets.