#24610 closed defect (duplicate)

assertion !(status == HS_CLIENT_FETCH_HAVE_DESC) failed

Reported by: cypherpunks Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.2.1-alpha
Severity: Normal Keywords: prop224, tor-hs, 032-backport, 034-triage-20180328, 034-removed-20180328
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Mmm dd hh:01:30.000 [notice] Your system clock just jumped 2000 seconds forward; assuming established circuits no longer work.
Mmm dd hh:02:00.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Mmm dd hh:02:00.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Mmm dd hh:03:07.000 [warn] tor_bug_occurred_(): Bug: src/or/hs_client.c:267: retry_all_socks_conn_waiting_for_desc: Non-fatal assertion !(status == HS_CLIENT_FETCH_HAVE_DESC) failed. (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug: Non-fatal assertion !(status == HS_CLIENT_FETCH_HAVE_DESC) failed in retry_all_socks_conn_waiting_for_desc at src/or/hs_client.c:267. Stack trace: (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(log_backtrace+0x52) [0x3d56c44a32] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(tor_bug_occurred_+0xc5) [0x3d56c61835] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(hs_client_dir_info_changed+0x116) [0x3d56c23fc6] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(networkstatus_set_current_consensus+0x3c2) [0x3d56b15f02] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(connection_dir_reached_eof+0x14db) [0x3d56bf23fb] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(+0x10ff39) [0x3d56bc8f39] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(+0x5075d) [0x3d56b0975d] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/lib64/libevent-2.1.so.6(+0x229ed) [0x3bd918df9ed] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/lib64/libevent-2.1.so.6(event_base_loop+0x4f7) [0x3bd918e07f7] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(do_main_loop+0x29d) [0x3d56b0a76d] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(tor_main+0xe2d) [0x3d56b0d62d] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(main+0x28) [0x3d56b05ad8] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /lib64/libc.so.6(__libc_start_main+0xf6) [0x3bd8ff35f66] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:07.000 [warn] Bug:     /usr/bin/tor(_start+0x2a) [0x3d56b05b2a] (on Tor 0.3.2.6-alpha 87012d076ef58bb9)
Mmm dd hh:03:40.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:port. Giving up. (waiting for rendezvous desc)

Don't know if suspend+resume is necessary to trigger the bug.

Child Tickets

Change History (13)

comment:1 Changed 22 months ago by teor

Keywords: prop224 tor-hs added
Milestone: Tor: 0.3.2.x-final
Version: Tor: 0.3.2.1-alpha

comment:2 Changed 22 months ago by dgoulet

Comment from the code:

    status = hs_client_refetch_hsdesc(&edge_conn->hs_ident->identity_pk);
    if (BUG(status == HS_CLIENT_FETCH_HAVE_DESC)) {
      /* This case is unique because it can NOT happen in theory. Once we get
       * a new descriptor, the HS client subsystem is notified immediately and
       * the connections waiting for it are handled which means the state will
       * change from renddesc wait state. Log this and continue to next
       * connection. */
      continue;
    }

So, I think it is a wrong logic. Consider the following case:

Imagine your client has fetched the descriptor for a service and at some point, all intro points weren't usable or in other words the intro points rotated and the client now has to fetch a new descriptor. Then this suspend happens while waiting for the descriptor fetch.

We come back from suspend, the clock jumps 2000 seconds (33 minutes) and rend_cache_failure_clean_callback() is called (it is an housekeeping function which runs every 30 seconds). It cleans up the intro point failure cache (lifetime of entries are 2 minutes). Which means that the descriptor we have in the cache now suddenly is considered "usable".

During that waiting period for the descriptor, a new consensus comes in so we have new directory information and we end up in retrying all pending SOCKS connection waiting for a descriptor retry_all_socks_conn_waiting_for_desc() that might have been hold back in the first place because we were missing directory information.

Then, the BUG() is hit if we try a refetch for a service but for which we already have a descriptor. And that descriptor became usable after the suspend.

So this means that the BUG() should not be there and be considered possible to have that situation.

And extra bonus point, this should actually be a special case that we handle for which we try to attach the connection to a circuit instead of leaving it in limbo in RENDDESC_WAIT state and timing out after 120 seconds. In short, put the SOCKS conn in circuit wait state and try to attach it if we have a usable descriptor.

comment:3 Changed 22 months ago by asn

Both commits look reasonable to me. I think we could do a better job at explaining the edge-case in the first commit message because I got quite confused reading it (it also seems to be a different issue than the one described in comment:2).

Here is the edge-case I consider:

1) Network trouble and we end up with unreachable intro points on an HS desc.
2) We start fetching new HS desc.
3) rend_cache_failure_clean_callback() triggers unluckily before receiving new HS desc, and makes the descriptor usable again.
4) We receive some leftover mds/consensus we asked for a few moments ago. This triggers hs_client_dir_info_changed() and we try to refetch the desc, but seems like the desc is now usable! Ugh bug triggers.

The first commit should fix this scenario because now can_client_refetch_desc() will return HS_CLIENT_FETCH_PENDING instead of HS_CLIENT_FETCH_HAVE_DESC. I wonder if there are any other such weird edge-cases this patch won't fix :S

Do we have a way to test this fix btw?

comment:4 Changed 22 months ago by dgoulet

We shouldn't think comment:2 is accurate also. After working on this, a simpler edge-case to trigger this became apparent so what you describe is really what should be considered as "the problem" of this ticket.

The first commit should fix this scenario because now can_client_refetch_desc() will return HS_CLIENT_FETCH_PENDING instead of HS_CLIENT_FETCH_HAVE_DESC. I wonder if there are any other such weird edge-cases this patch won't fix :S

As you noticed, I left the BUG() there and even added a non fatal assert so if this is triggered again, we can investigate more. At least, in both cases, the tor client is fine and recovers.

Do we have a way to test this fix btw?

I would love to come up with a test case. It is a bit difficult because we kind of need to artificially inject the race in a unit test. I think a useful unit test overall would be to make sure we can *NOT* launch 2 directory requests in parallel for the same service. The can_client_refetch_desc() is really our safeguard there.

The other thing we can unit test probably easily is have a pending directory request, after set a descriptor usable and call hs_client_dir_info_changed() to see that we don't get that usable descriptor.

Both seems very doable. In the timeframe of 032 release candidate, dunno but definitely short term, this is needed.

comment:5 in reply to:  4 Changed 22 months ago by asn

Replying to dgoulet:

We shouldn't think comment:2 is accurate also. After working on this, a simpler edge-case to trigger this became apparent so what you describe is really what should be considered as "the problem" of this ticket.

The first commit should fix this scenario because now can_client_refetch_desc() will return HS_CLIENT_FETCH_PENDING instead of HS_CLIENT_FETCH_HAVE_DESC. I wonder if there are any other such weird edge-cases this patch won't fix :S

As you noticed, I left the BUG() there and even added a non fatal assert so if this is triggered again, we can investigate more. At least, in both cases, the tor client is fine and recovers.

Sounds reasonable. Let's see if we can find a way to test this well.

If not testable robustly, we should consider whether we want to merge this in the last -rc, or just leave it for 033. This seems like a pretty wild edge-case so delaying might not be terrible.

comment:6 Changed 22 months ago by nickm

Keywords: 032-backport added
Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

comment:7 Changed 21 months ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Moving a bunch of tickets from 033 to 034.

comment:8 Changed 21 months ago by teor

I just saw this bug in a chutney network with:

chutney/tools/test-network.sh --flavour hs-v23-exit-min --connections 2 --rounds 7 \
    --bytes 1024 --start-time 50

So you might be able to test your patch using chutney.

comment:9 Changed 20 months ago by teor

I just saw this bug in another chutney network hs-v23-ipv6-md, when my machine went to sleep during make test-network-all.

comment:10 Changed 19 months ago by nickm

Keywords: 034-triage-20180328 added

comment:11 Changed 19 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:12 Changed 18 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:13 Changed 13 months ago by asn

Resolution: duplicate
Status: newclosed

Closing as a duplicate of #27410.

Let me know if that's not right

Note: See TracTickets for help on using tickets.