Opened 4 weeks ago

Last modified 3 weeks ago

#32108 merge_ready defect

tor can overrun its accountingmax if it enters soft hibernation first

Reported by: arma Owned by: arma
Priority: High Milestone: Tor: 0.4.1.x-final
Component: Core Tor/Tor Version: Tor: 0.4.0.1-alpha
Severity: Normal Keywords: extra-review consider-backport-after-0424 network-health 042-backport 041-backport 040-backport BugSmashFund
Cc: Actual Points:
Parent ID: Points:
Reviewer: dgoulet, teor Sponsor:

Description

I'll put the punchline first: second_elapsed_callback(), which is where we check if it's time to go dormant for hibernation, is no longer called when we've entered soft hibernation.

I assume this is because of the new "periodic event flag" feature, where we try to avoid calling callbacks when we're in a state that won't need them. See the "online and active" note here:

  /* This is a legacy catch-all callback that runs once per second if
   * we are online and active. */
  CALLBACK(second_elapsed, NET_PARTICIPANT,
           FL(NEED_NET)|FL(RUN_ON_DISABLE)),

The impact is limited, since we stop accepting new connections and new circuits when we enter soft hibernation, but it can still be bad: existing connections and circuits can last for a long time and use a lot of bandwidth.

A secondary impact is that accounting_run_housekeeping() never gets called, which means that the state file never gets updated after we've entered soft hibernation, which means these bandwidth overspends are never recorded to disk.

I think the bug went in during commit 4bf79fa4f which is part of Tor 0.4.0.1-alpha.

The PERIODIC_EVENT_FLAG_NEED_NET flag (what FL(NEED_NET) expands into) checks net_is_disabled(), but there is another function right after net_is_disabled in netstatus.c called net_is_completely_disabled(), and the only difference is that it checks we_are_fully_hibernating() vs we_are_hibernating().

I confirmed the overall bug happens in practice: there's a relay operator in #tor who hit soft hibernation, and then saw his tor proceed to use more bytes than expected. I had him do a 'gdb attach' to his tor and break on 'second_elapsed_callback' and the function never got called.

It seems like the immediate fix, and best backport plan, would be to resume calling second_elapsed_callback even when net_is_disabled(). The longer term plan can be to audit our calls to net_is_disabled() and net_is_completely_disabled() and we_are_hibernating(), with an eye towards "should we be doing this behavior while soft hibernating", and see what other bugs we find.

Child Tickets

Change History (21)

comment:1 Changed 4 weeks ago by arma

Here is a patch that ought to work, for relay operators getting bitten by this bug today:

diff --git a/src/core/mainloop/mainloop.c b/src/core/mainloop/mainloop.c
index 2f8be9961d..6f29a6981a 100644
--- a/src/core/mainloop/mainloop.c
+++ b/src/core/mainloop/mainloop.c
@@ -1388,7 +1388,7 @@ STATIC periodic_event_item_t mainloop_periodic_events[] = {
   /* This is a legacy catch-all callback that runs once per second if
    * we are online and active. */
   CALLBACK(second_elapsed, NET_PARTICIPANT,
-           FL(NEED_NET)|FL(RUN_ON_DISABLE)),
+           FL(RUN_ON_DISABLE)),
 
   /* XXXX Do we have a reason to do this on a callback? Does it do any good at
    * all?  For now, if we're dormant, we can let our listeners decay. */

The eventual patch we merge into the code might be different. Or maybe this is a great choice for a fix on Tor stable, and we'll do something smarter for the 0.4.2.x or 0.4.3.x development releases.

comment:2 Changed 4 weeks ago by nickm

Keywords: 042-backport 041-backport 040-backport 035-backport added
Milestone: Tor: 0.4.3.x-final

comment:3 in reply to:  2 Changed 4 weeks ago by arma

Keywords: 035-backport removed

Replying to nickm:

(I don't think it needs 035-backport, since I think 0.3.5's behavior is from before the bug was introduced.)

comment:4 Changed 3 weeks ago by arma

Status: newneeds_review

See my bug32108-option1 branch for a fix here. It's based on maint-0.4.0 because that's where the bug started.

(Nick or dgoulet would be good reviewer options because they worked on e.g #26064.)

I call it 'option1' because I am still pondering another approach, where we instead make PERIODIC_EVENT_FLAG_NEED_NET check net_is_completely_disabled(). That approach looks much harder to audit though, and more likely to have weird side effects on the 0.4.0 and 0.4.1 stables.

comment:5 Changed 3 weeks ago by dgoulet

Keywords: BugSmashFund added
Reviewer: dgoulet

comment:6 Changed 3 weeks ago by ahf

Status: needs_reviewneeds_revision

I have said this before and I will repeat it here: in the network team we use Github to handle "Pull Requests" for internal people in Tor. It is okay to not like Github, but that is what we have chosen to use nevertheless. For external people we are very flexible. You are in the first category.

We enforce this for the following reasons:

Firstly, the Github PR system forces the developer (that is you) to watch out for the CI (which happens *before* your code lands and *before* you ask anybody to review your code. This is unlike our Jenkins CI which happens *after* your code have landed. This allows us to catch bugs earl(y|ier) and thus makes the quality of Tor a lot better, which is a common goal for all of us, hopefully! We even run upfront CI on both Windows and Linux, so we catch quite a lot of things by doing this and people work hard on making this experience better, more smooth, and faster for everyone.

Secondly, in the network team we all help with the "coding related overhead". We do this by distributing the "boring tasks" that is around software development between each other, which ensures that everyone is having a reasonably "good time". This only works if *everybody* does the overhead work for their own stuff themselves. This includes logging into Github, opening a PR to torproject/tor, waiting for the PR to have both appveyor and travis go "green". If they don't, you have another iteration of your code to do. Once your PR is green, you can change status to needs_review and a nice person from the network team will happily review your code.

So no matter how small or simple a patch it, it still needs to go through the Github system for internal people. If you don't do it, someone else from the network team will have to do it for you and that is not fair.

Please resubmit this as a PR and re-request review again! If you need help with doing this, we are very open to helping out. Thanks! :-)

comment:7 Changed 3 weeks ago by gaba

Owner: set to arma
Status: needs_revisionassigned

comment:8 Changed 3 weeks ago by gaba

Status: assignedneeds_revision

comment:9 Changed 3 weeks ago by teor

Keywords: extra-review consider-backport-after-0424 added
Reviewer: dgouletdgoulet, teor
Status: needs_revisionneeds_review

I think this is the pull request:

I think this is a good change, but I'd like dgoulet to also check it.

Are there any conflicts when it merges forward?

We have a script for creating test merge forward branches, if you need it:
https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/MergeProcess#HowdoweCreateBackportBranches

Please also open a ticket for the longer-term work:

It seems like the immediate fix, and best backport plan, would be to resume calling second_elapsed_callback even when net_is_disabled(). The longer term plan can be to audit our calls to net_is_disabled() and net_is_completely_disabled() and we_are_hibernating(), with an eye towards "should we be doing this behavior while soft hibernating", and see what other bugs we find.

comment:10 Changed 3 weeks ago by dgoulet

Status: needs_reviewmerge_ready

I think this is correct change.

However, semantically, we are really in a pit with second_elapsed_callback() that does way too many things and some checks explicitly look at net_is_disabled() which I think is why this fix is OK.

But a long term proper fix I believe would be to move the accounting into its own callback. And the expiring of circuit/connections code as well.

Last edited 3 weeks ago by dgoulet (previous) (diff)

comment:11 Changed 3 weeks ago by arma

Yeah, I'm pretty sad about turning all those things back on -- it means that Tor on Android stops getting some of the cpu and battery gains for example. But it is a huge bug in stable relays, and we must fix it, and we must not break stuff when we fix it, so it's the best plan I've got.

comment:12 Changed 3 weeks ago by nickm

I haven't read the patch yet, but is it reasonable to make this once-per-second callback happen only when accounting is turned on?

comment:13 Changed 3 weeks ago by arma

Yes, I think that would be fine. Or rather, it needs to call the event when net is on, and also when accounting is on. I guess that is what you meant by 'only'.

It would also be reasonable to make a new FL(NEED_NET_ACTUALLY) that checks net_is_completely_disabled() rather than net_is_disabled(), and use it for that one situation.

I didn't want to complexify things though.

comment:14 Changed 3 weeks ago by nickm

Status: merge_readyneeds_revision

Let's try to do this in a way that doesn't cause a massive performance regression.

comment:15 Changed 3 weeks ago by nickm

Status: needs_revisionneeds_review

Actually, this might not have the performance regression we're talking about.

The core of the patch here is:

  CALLBACK(second_elapsed, NET_PARTICIPANT,
-           FL(NEED_NET)|FL(RUN_ON_DISABLE)),
+           FL(RUN_ON_DISABLE)),

So note that this periodic event is still labeled as NET_PARTICIPANT. IIRC, when a client enters dormant mode, it lstill loses NET_PARTICIPANT: NEED_NET is the only thing affected by hibernation.

So I think this might be okay?

comment:16 in reply to:  15 Changed 3 weeks ago by dgoulet

Replying to nickm:

Actually, this might not have the performance regression we're talking about.

The core of the patch here is:

  CALLBACK(second_elapsed, NET_PARTICIPANT,
-           FL(NEED_NET)|FL(RUN_ON_DISABLE)),
+           FL(RUN_ON_DISABLE)),

So note that this periodic event is still labeled as NET_PARTICIPANT. IIRC, when a client enters dormant mode, it lstill loses NET_PARTICIPANT: NEED_NET is the only thing affected by hibernation.

So I think this might be okay?

I believe you are right. We have a specific callbacks that check for network participation (check_network_participation_callback()) and it will set network participation to false if the client is dormant.

Relays will always be participating and thus should always call second_elapsed.

comment:17 Changed 3 weeks ago by nickm

Okay. As discussed on IRC, I'd like somebody to test these hypotheses out before we merge. (Those being hypotheses being: that dormant mode for clients still works, and that this actually fixes the accounting bug.)

comment:18 Changed 3 weeks ago by dgoulet

I've brought down my DormantClientTimeout as low as I could and I confirm that with this patch the second_elapsed gets disabled once client become dormant. In other words, client still works and goes dormant with this patch.

Then I did a test with a relay with AccountingMax 10 MB and very early on:

Oct 22 14:36:39.396 [notice] Bandwidth soft limit reached; commencing hibernation. No new connections will be accepted
[...]
Oct 22 14:36:39.397 [notice] Launching periodic event second_elapsed

And then when it hit full dormant mode, the event is still enabled so accounting housekeeping will be executed.

comment:19 Changed 3 weeks ago by dgoulet

Status: needs_reviewmerge_ready

comment:20 Changed 3 weeks ago by nickm

Milestone: Tor: 0.4.3.x-finalTor: 0.4.1.x-final
Priority: MediumHigh

Okay, now I'm convinced. Let's try it out in 0.4.2.3-alpha, then backport.

(Merged to 0.4.2 and forward, marking for backport.)

comment:21 Changed 3 weeks ago by teor

Normally, we wait one whole alpha release before backporting. (So we could backport this ticket after 0.4.2.4.)

Do we need to backport it urgently?
If we do, can we do extra testing?

Note: See TracTickets for help on using tickets.