Opened 5 months ago

Last modified 4 months ago

#25306 merge_ready defect

tor_assertion_failed_(): Bug: ../src/or/hs_service.c:1985: rotate_all_descriptors: Assertion service->desc_current failed; aborting.

Reported by: cypherpunks Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version: Tor: 0.3.3.2-alpha
Severity: Normal Keywords: tor-hs, crash, 033-must, 032-backport, review-group-34, 033-triage-20180320, 033-included-20180320
Cc: torproject.org@…, asn Actual Points:
Parent ID: Points:
Reviewer: asn Sponsor:

Description

I got this error shortly after my computer clock jumped forward a couple of hours. Sadly the log is all I have.

Child Tickets

Attachments (1)

tor.log (2.4 KB) - added by cypherpunks 5 months ago.
tor syslog entries

Download all attachments as: .zip

Change History (29)

Changed 5 months ago by cypherpunks

Attachment: tor.log added

tor syslog entries

comment:1 Changed 5 months ago by cypherpunks

Milestone: Tor: 0.3.3.x-final
tor_assertion_failed_(): Bug: ../src/or/hs_service.c:1985: rotate_all_descriptors: Assertion service->desc_current failed; aborting. (on Tor 0.3.3.2-alpha )
Bug: Assertion service->desc_current failed in rotate_all_descriptors at ../src/or/hs_service.c:1985. Stack trace: (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(log_backtrace+0x44) [0x556042f9ade4] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(tor_assertion_failed_+0x8d) [0x556042fb638d] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(hs_service_run_scheduled_events+0x848) [0x556042f894f8] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(+0x50611) [0x556042e62611] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(+0x70ee0) [0x556042e82ee0] (on Tor 0.3.3.2-alpha )
Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x6a0) [0x7f66827aa5a0] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(do_main_loop+0x2bc) [0x556042e6681c] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(tor_run_main+0x275) [0x556042e67f25] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(tor_main+0x3a) [0x556042e6136a] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(main+0x19) [0x556042e610d9] (on Tor 0.3.3.2-alpha )
Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f66810062b1] (on Tor 0.3.3.2-alpha )
Bug:     /usr/bin/tor(_start+0x2a) [0x556042e6112a] (on Tor 0.3.3.2-alpha )

comment:2 Changed 5 months ago by nickm

Cc: dgoulet asn added
Keywords: tor-hs crash 033-must 032-backport added

comment:3 Changed 5 months ago by teor

Keywords: 031-backport added

hs_service.c is also in maint-0.3.1

comment:4 in reply to:  3 Changed 5 months ago by dgoulet

Keywords: 031-backport removed

Replying to teor:

hs_service.c is also in maint-0.3.1

In 031, it is a place holder only containing few functions for HS cell unit tests. This fix won't apply to 031, only 032 and forward where the HSv3 service support was added.

Last edited 5 months ago by dgoulet (previous) (diff)

comment:5 Changed 5 months ago by asn

Did some digging here but it's hard to figure it out without any logs :( Not sure how we can end up with desc_current being NULL there after a clock jump. I can think of some ways to avoid the assert, but I'd like to find the root cause instead of blindly applying duct-tape.

Perhaps one way to approach this is to try to reproduce this ourselves by setting up an hsv3 and clock jumping. Alternatively, we can make a branch with some diagnostics in the case of this bugs, and wait for some useful logs to appear.

In any case, due to the clock jump scenario this does not seem like a common urgent bug to fix, but we should try to address it soon.

comment:6 Changed 5 months ago by dgoulet

Hmmmmmm there is something very weird in rotate_all_descriptors().

See the comment in this code snippet:

    /* Note for a service booting up: Both descriptors are NULL in that case
     * so this function might return true if we are in the timeframe for a
     * rotation leading to basically swapping two NULL pointers which is
     * harmless. However, the side effect is that triggering a rotation will
     * update the service state and avoid doing anymore rotations after the
     * two descriptors have been built. */
    if (!should_rotate_descriptors(service, now)) {
      continue;
    }

    tor_assert(service->desc_current);
    tor_assert(service->desc_next);

We do rotation _before_ building descriptors (assuming we are booting up). The comment mentions that it is possible to have NULL descriptors at first and that is OK because we just swap NULL pointers.

But then why those asserts!? Comes from commit cd07af60c9e but it doesn't tell me much why they were put there especially with such a comment.

I think those asserts() should *not* be there and we haven't hit those yet because when tor boots up, I bet we are able to build the descriptors before we get the latest consensus so when we end up in a situation where should_rotate_descriptors() returns true, we have the descriptors.

comment:7 Changed 5 months ago by dgoulet

Cc: dgoulet removed
Owner: set to dgoulet
Status: newaccepted

I might have an idea on how to trigger those asserts in a unit test. Taking ownership.

comment:8 Changed 5 months ago by dgoulet

So I was able to recreate this in the lab. Quick unit test that hits the asserts(). Unfortunately, there are unknowns on how desc_current can end up NULL after a suspend but still, here are my findings:

If the service next_rotation_time ends up smaller than the valid_after time of the consensus, which usually it can't because when we do a rotation, we set it to the next time which is always in the future (after valid_after).

However, when going in suspend and coming back to life, tor gets a new consensus and thus the valid_after time is bigger than the next_rotation_time. Which means that tor will allow the service to rotate the descriptors going straight into those asserts().

So how can desc_current be NULL or even desc_next? For a regular service, I have no idea... I think it could be likely with ADD_ONION but the fact that we set the rotation time using the consensus and do it _before_ we build descriptors, we should never hit the asserts() because that time will always be smaller than valid_after so we won't allowed to rotate and then just after we'll build the descriptors.

Puzzling...

comment:9 Changed 5 months ago by asn

Status: acceptedneeds_information

I think we should figure out the precise bug here before removing asserts or anything that could shadow this bug. I think some log lines would really help out here, so perhaps we can wait till someone (or us) hit this bug and posts some more info here?

I'll put it in needs_info for now. Feel free to adjust if you disagree.

comment:10 Changed 5 months ago by teor

Status: needs_informationnew

As a first step, let's do what nickm suggested, and make the asserts non-fatal, and add extra logging.

We might miss a few asserts that way, but at least we won't crash people's services.
(Remember, some people rely on their services to log in to their machines and so it's hard for them to retrieve crash logs.)

comment:11 Changed 5 months ago by dgoulet

Agree, our move here for 033 I think is make those asserts() not fatal, recover but print useful information when it happens such as that next rotation time for instance!

Last edited 5 months ago by dgoulet (previous) (diff)

comment:12 Changed 5 months ago by dgoulet

Status: newneeds_review

What do we think of: bug25306_032_01

(Will need a changes file)

comment:13 in reply to:  12 ; Changed 5 months ago by asn

Status: needs_reviewneeds_revision

Replying to dgoulet:

What do we think of: bug25306_032_01

(Will need a changes file)

Hmm, plausible! There is another assert tho in build_all_descriptors() that we would hit in this case. We should probs plug that too.

comment:14 in reply to:  13 Changed 5 months ago by dgoulet

Replying to asn:

Hmm, plausible! There is another assert tho in build_all_descriptors() that we would hit in this case. We should probs plug that too.

Good catch. Having current NULL and next non NULL would be very bizarre but lets catch it properly in case.

See fixup?: 9792961b37

comment:15 Changed 5 months ago by dgoulet

Status: needs_revisionneeds_review

comment:16 Changed 5 months ago by asn

Status: needs_reviewneeds_revision

Two last things:

  • Does the code that we added in this ticket allow the service to recover from this issue where it has no descriptors? That is, will it ever rebuild the needed descriptors and rotate them gracefully, or will it get permanently stuck in a broken state? My understanding from reading the code a bit is that the latter case will happen. Do we care about the latter if it happens?
  • In rotate_all_descriptors() let's add the BUG check inside should_rotate_descriptors(). I think that's conceptually the right place to do any check wrt rotating descriptors.

comment:17 in reply to:  16 Changed 5 months ago by dgoulet

Reviewer: asn
Status: needs_revisionneeds_review

Replying to asn:

Two last things:

  • Does the code that we added in this ticket allow the service to recover from this issue where it has no descriptors? That is, will it ever rebuild the needed descriptors and rotate them gracefully, or will it get permanently stuck in a broken state? My understanding from reading the code a bit is that the latter case will happen. Do we care about the latter if it happens?

I think it will get stuck if current is NULL and next is not. We won't be able to rotate nor build a current one if no rotation.

If both are NULL, it won't get stuck, new desc will be created. If only next is NULL, it won't get stuck as it will be built just after.

Tbh, having a current == NULL, which I don't think is possible without also having a next == NULL, we have a very bad code flow issue. I think at that point, we might want the service to stop working properly so the operator can notice the BUG() and tell us. I've personally never ever seen this so if it happens (like this ticket maybe), it should be rare or if not, we'll catch it soon enough.

  • In rotate_all_descriptors() let's add the BUG check inside should_rotate_descriptors(). I think that's conceptually the right place to do any check wrt rotating descriptors.

See fixup commit: 4cad1376c6

Notice that I changed a bit the log message to print the desc pointer as in we do want to know which desc are NULL.

comment:18 Changed 5 months ago by asn

Status: needs_reviewmerge_ready

LGTM.

(I don't really like that the new check is nested inside an old check, but fixing that would introduce non-trivial changes (but still pretty easy to do). Feel free to fix that if you want or not.)

comment:19 Changed 5 months ago by nickm

I'm okay with merging this. David, do you want to make a change for George's comment above, or should I squash and merge as-is?

comment:20 Changed 5 months ago by nickm

Keywords: review-group-34 added

comment:21 Changed 4 months ago by dgoulet

I kind of think we need to nest the check unfortunately in order to catch the bug. There is really only one way this can happen and it is if our valid_after time is bigger than the service rotation time.

If we were to not nest the checks, we would have something like this:

   if (ns->valid_after >= service->state.next_rotation_time ||
       BUG(service->desc_current == NULL || service->desc_next == NULL)) {
     goto no_rotation;
   }
   if (ns->valid_after >= service->state.next_rotation_time) {
     goto rotation;
   }

I'm not super convinced it makes the code simpler? or more trivial?

comment:22 Changed 4 months ago by nickm

Keywords: 033-triage-20180320 added

Marking all tickets reached by current round of 033 triage.

comment:23 Changed 4 months ago by nickm

Keywords: 033-included-20180320 added

Mark 033-must tickets as triaged-in for 0.3.3

comment:24 in reply to:  21 Changed 4 months ago by asn

Replying to dgoulet:

I kind of think we need to nest the check unfortunately in order to catch the bug. There is really only one way this can happen and it is if our valid_after time is bigger than the service rotation time.

If we were to not nest the checks, we would have something like this:

   if (ns->valid_after >= service->state.next_rotation_time ||
       BUG(service->desc_current == NULL || service->desc_next == NULL)) {
     goto no_rotation;
   }
   if (ns->valid_after >= service->state.next_rotation_time) {
     goto rotation;
   }

I'm not super convinced it makes the code simpler? or more trivial?

Agreed. We can proceed as is.

comment:25 Changed 4 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final

Squashed as bug25306_032_01_squashed and merging to 0.3.3. If it works out there, let's backport to 0.3.2.

comment:26 Changed 4 months ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Status: merge_readyneeds_revision

Oh wait. Not merging yet. Please write a changes file?

comment:27 in reply to:  26 Changed 4 months ago by dgoulet

Status: needs_revisionmerge_ready

Replying to nickm:

Oh wait. Not merging yet. Please write a changes file?

Pushed in fixup commit afe0902b1ac32404

comment:28 Changed 4 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final

Okay, now the branch to backport is called bug25306_032_01_squashed_v2. It's merged to 0.3.3 and forward. Possible 0.3.2 backport

Note: See TracTickets for help on using tickets.