Opened 12 months ago
Closed 3 months ago
#25306 closed defect (fixed)
tor_assertion_failed_(): Bug: ../src/or/hs_service.c:1985: rotate_all_descriptors: Assertion service->desc_current failed; aborting.
Reported by: | cypherpunks | Owned by: | dgoulet |
---|---|---|---|
Priority: | Medium | Milestone: | Tor: 0.3.2.x-final |
Component: | Core Tor/Tor | Version: | Tor: 0.3.3.2-alpha |
Severity: | Normal | Keywords: | tor-hs, crash, 033-must, review-group-34, 033-triage-20180320, 033-included-20180320, 032-unreached-backport |
Cc: | torproject.org@…, asn | Actual Points: | |
Parent ID: | Points: | ||
Reviewer: | asn | Sponsor: |
Description
I got this error shortly after my computer clock jumped forward a couple of hours. Sadly the log is all I have.
Child Tickets
Attachments (1)
Change History (31)
Changed 12 months ago by
comment:1 Changed 12 months ago by
Milestone: | → Tor: 0.3.3.x-final |
---|
tor_assertion_failed_(): Bug: ../src/or/hs_service.c:1985: rotate_all_descriptors: Assertion service->desc_current failed; aborting. (on Tor 0.3.3.2-alpha ) Bug: Assertion service->desc_current failed in rotate_all_descriptors at ../src/or/hs_service.c:1985. Stack trace: (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(log_backtrace+0x44) [0x556042f9ade4] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(tor_assertion_failed_+0x8d) [0x556042fb638d] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(hs_service_run_scheduled_events+0x848) [0x556042f894f8] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(+0x50611) [0x556042e62611] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(+0x70ee0) [0x556042e82ee0] (on Tor 0.3.3.2-alpha ) Bug: /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x6a0) [0x7f66827aa5a0] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(do_main_loop+0x2bc) [0x556042e6681c] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(tor_run_main+0x275) [0x556042e67f25] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(tor_main+0x3a) [0x556042e6136a] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(main+0x19) [0x556042e610d9] (on Tor 0.3.3.2-alpha ) Bug: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f66810062b1] (on Tor 0.3.3.2-alpha ) Bug: /usr/bin/tor(_start+0x2a) [0x556042e6112a] (on Tor 0.3.3.2-alpha )
comment:2 Changed 12 months ago by
Cc: | dgoulet asn added |
---|---|
Keywords: | tor-hs crash 033-must 032-backport added |
comment:3 follow-up: 4 Changed 12 months ago by
Keywords: | 031-backport added |
---|
hs_service.c is also in maint-0.3.1
comment:4 Changed 12 months ago by
Keywords: | 031-backport removed |
---|
Replying to teor:
hs_service.c is also in maint-0.3.1
In 031, it is a place holder only containing few functions for HS cell unit tests. This fix won't apply to 031, only 032 and forward where the HSv3 service support was added.
comment:5 Changed 12 months ago by
Did some digging here but it's hard to figure it out without any logs :( Not sure how we can end up with desc_current
being NULL there after a clock jump. I can think of some ways to avoid the assert, but I'd like to find the root cause instead of blindly applying duct-tape.
Perhaps one way to approach this is to try to reproduce this ourselves by setting up an hsv3 and clock jumping. Alternatively, we can make a branch with some diagnostics in the case of this bugs, and wait for some useful logs to appear.
In any case, due to the clock jump scenario this does not seem like a common urgent bug to fix, but we should try to address it soon.
comment:6 Changed 12 months ago by
Hmmmmmm there is something very weird in rotate_all_descriptors()
.
See the comment in this code snippet:
/* Note for a service booting up: Both descriptors are NULL in that case * so this function might return true if we are in the timeframe for a * rotation leading to basically swapping two NULL pointers which is * harmless. However, the side effect is that triggering a rotation will * update the service state and avoid doing anymore rotations after the * two descriptors have been built. */ if (!should_rotate_descriptors(service, now)) { continue; } tor_assert(service->desc_current); tor_assert(service->desc_next);
We do rotation _before_ building descriptors (assuming we are booting up). The comment mentions that it is possible to have NULL descriptors at first and that is OK because we just swap NULL pointers.
But then why those asserts!? Comes from commit cd07af60c9e
but it doesn't tell me much why they were put there especially with such a comment.
I think those asserts() should *not* be there and we haven't hit those yet because when tor boots up, I bet we are able to build the descriptors before we get the latest consensus so when we end up in a situation where should_rotate_descriptors()
returns true, we have the descriptors.
comment:7 Changed 12 months ago by
Cc: | dgoulet removed |
---|---|
Owner: | set to dgoulet |
Status: | new → accepted |
I might have an idea on how to trigger those asserts in a unit test. Taking ownership.
comment:8 Changed 12 months ago by
So I was able to recreate this in the lab. Quick unit test that hits the asserts(). Unfortunately, there are unknowns on how desc_current
can end up NULL after a suspend but still, here are my findings:
If the service next_rotation_time
ends up smaller than the valid_after
time of the consensus, which usually it can't because when we do a rotation, we set it to the next time which is always in the future (after valid_after).
However, when going in suspend and coming back to life, tor gets a new consensus and thus the valid_after time is bigger than the next_rotation_time
. Which means that tor will allow the service to rotate the descriptors going straight into those asserts().
So how can desc_current
be NULL or even desc_next
? For a regular service, I have no idea... I think it could be likely with ADD_ONION
but the fact that we set the rotation time using the consensus and do it _before_ we build descriptors, we should never hit the asserts() because that time will always be smaller than valid_after
so we won't allowed to rotate and then just after we'll build the descriptors.
Puzzling...
comment:9 Changed 12 months ago by
Status: | accepted → needs_information |
---|
I think we should figure out the precise bug here before removing asserts or anything that could shadow this bug. I think some log lines would really help out here, so perhaps we can wait till someone (or us) hit this bug and posts some more info here?
I'll put it in needs_info
for now. Feel free to adjust if you disagree.
comment:10 Changed 12 months ago by
Status: | needs_information → new |
---|
As a first step, let's do what nickm suggested, and make the asserts non-fatal, and add extra logging.
We might miss a few asserts that way, but at least we won't crash people's services.
(Remember, some people rely on their services to log in to their machines and so it's hard for them to retrieve crash logs.)
comment:11 Changed 12 months ago by
Agree, our move here for 033 I think is make those asserts() not fatal, recover but print useful information when it happens such has that next rotation time!
comment:12 follow-up: 13 Changed 12 months ago by
Status: | new → needs_review |
---|
What do we think of: bug25306_032_01
(Will need a changes file)
comment:13 follow-up: 14 Changed 12 months ago by
Status: | needs_review → needs_revision |
---|
Replying to dgoulet:
What do we think of:
bug25306_032_01
(Will need a changes file)
Hmm, plausible! There is another assert tho in build_all_descriptors()
that we would hit in this case. We should probs plug that too.
comment:14 Changed 12 months ago by
Replying to asn:
Hmm, plausible! There is another assert tho in
build_all_descriptors()
that we would hit in this case. We should probs plug that too.
Good catch. Having current NULL and next non NULL would be very bizarre but lets catch it properly in case.
See fixup?: 9792961b37
comment:15 Changed 12 months ago by
Status: | needs_revision → needs_review |
---|
comment:16 follow-up: 17 Changed 12 months ago by
Status: | needs_review → needs_revision |
---|
Two last things:
- Does the code that we added in this ticket allow the service to recover from this issue where it has no descriptors? That is, will it ever rebuild the needed descriptors and rotate them gracefully, or will it get permanently stuck in a broken state? My understanding from reading the code a bit is that the latter case will happen. Do we care about the latter if it happens?
- In
rotate_all_descriptors()
let's add theBUG
check insideshould_rotate_descriptors()
. I think that's conceptually the right place to do any check wrt rotating descriptors.
comment:17 Changed 12 months ago by
Reviewer: | → asn |
---|---|
Status: | needs_revision → needs_review |
Replying to asn:
Two last things:
- Does the code that we added in this ticket allow the service to recover from this issue where it has no descriptors? That is, will it ever rebuild the needed descriptors and rotate them gracefully, or will it get permanently stuck in a broken state? My understanding from reading the code a bit is that the latter case will happen. Do we care about the latter if it happens?
I think it will get stuck if current
is NULL and next
is not. We won't be able to rotate nor build a current one if no rotation.
If both are NULL, it won't get stuck, new desc will be created. If only next is NULL, it won't get stuck as it will be built just after.
Tbh, having a current == NULL
, which I don't think is possible without also having a next == NULL
, we have a very bad code flow issue. I think at that point, we might want the service to stop working properly so the operator can notice the BUG() and tell us. I've personally never ever seen this so if it happens (like this ticket maybe), it should be rare or if not, we'll catch it soon enough.
- In
rotate_all_descriptors()
let's add theBUG
check insideshould_rotate_descriptors()
. I think that's conceptually the right place to do any check wrt rotating descriptors.
See fixup commit: 4cad1376c6
Notice that I changed a bit the log message to print the desc pointer as in we do want to know which desc are NULL.
comment:18 Changed 12 months ago by
Status: | needs_review → merge_ready |
---|
LGTM.
(I don't really like that the new check is nested inside an old check, but fixing that would introduce non-trivial changes (but still pretty easy to do). Feel free to fix that if you want or not.)
comment:19 Changed 12 months ago by
I'm okay with merging this. David, do you want to make a change for George's comment above, or should I squash and merge as-is?
comment:20 Changed 12 months ago by
Keywords: | review-group-34 added |
---|
comment:21 follow-up: 24 Changed 11 months ago by
I kind of think we need to nest the check unfortunately in order to catch the bug. There is really only one way this can happen and it is if our valid_after
time is bigger than the service rotation time.
If we were to not nest the checks, we would have something like this:
if (ns->valid_after >= service->state.next_rotation_time || BUG(service->desc_current == NULL || service->desc_next == NULL)) { goto no_rotation; } if (ns->valid_after >= service->state.next_rotation_time) { goto rotation; }
I'm not super convinced it makes the code simpler? or more trivial?
comment:22 Changed 11 months ago by
Keywords: | 033-triage-20180320 added |
---|
Marking all tickets reached by current round of 033 triage.
comment:23 Changed 11 months ago by
Keywords: | 033-included-20180320 added |
---|
Mark 033-must tickets as triaged-in for 0.3.3
comment:24 Changed 11 months ago by
Replying to dgoulet:
I kind of think we need to nest the check unfortunately in order to catch the bug. There is really only one way this can happen and it is if our
valid_after
time is bigger than the service rotation time.
If we were to not nest the checks, we would have something like this:
if (ns->valid_after >= service->state.next_rotation_time || BUG(service->desc_current == NULL || service->desc_next == NULL)) { goto no_rotation; } if (ns->valid_after >= service->state.next_rotation_time) { goto rotation; }I'm not super convinced it makes the code simpler? or more trivial?
Agreed. We can proceed as is.
comment:25 Changed 11 months ago by
Milestone: | Tor: 0.3.3.x-final → Tor: 0.3.2.x-final |
---|
Squashed as bug25306_032_01_squashed and merging to 0.3.3. If it works out there, let's backport to 0.3.2.
comment:26 follow-up: 27 Changed 11 months ago by
Milestone: | Tor: 0.3.2.x-final → Tor: 0.3.3.x-final |
---|---|
Status: | merge_ready → needs_revision |
Oh wait. Not merging yet. Please write a changes file?
comment:27 Changed 11 months ago by
Status: | needs_revision → merge_ready |
---|
Replying to nickm:
Oh wait. Not merging yet. Please write a changes file?
Pushed in fixup commit afe0902b1ac32404
comment:28 Changed 11 months ago by
Milestone: | Tor: 0.3.3.x-final → Tor: 0.3.2.x-final |
---|
Okay, now the branch to backport is called bug25306_032_01_squashed_v2
. It's merged to 0.3.3 and forward. Possible 0.3.2 backport
comment:29 Changed 3 months ago by
Keywords: | 032-unreached-backport added; 032-backport removed |
---|
0.3.2 is end of life, so 032-backport is now 032-unreached-backport.
comment:30 Changed 3 months ago by
Resolution: | → fixed |
---|---|
Status: | merge_ready → closed |
These tickets do not need to be backported to 0.2.9, and 0.3.2 will not get any more backports. Closing.
tor syslog entries