Opened 18 months ago

Last modified 7 weeks ago

#25882 assigned defect

clients not detecting stale onion service introduction points

Reported by: cypherpunks Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, reachability, network-team-roadmap-september
Cc: dmr Actual Points:
Parent ID: #30200 Points:
Reviewer: Sponsor: Sponsor27-must

Description

Tor version 0.3.2.10 running on Debian stretch;

The Tor client runs continuously over many days. Periodically (daily in this case), an HTTP client reaches out to the same v2 onion service via Tor. Normally, this works correctly.

However, sometimes (no more than about 2/5 of the days), the HTTP client will fail to connect properly to the onion service, and its TCP connection will simply time out. Restarting Tor almost always solves this problem, although automating a reactive Tor restart is undesirable, particularly when doing so closes all of the circuits of all other client applications using the same Tor.

Looking closely, we observe that in such cases the Tor client is not able to complete a rendezvous with the onion service. Specifically, the Tor client reaches out to the same intro point many times in rapid succession, without ever managing to establish a connection. I surmise that a workaround might entail asking Tor to clear its cached onion service descriptor for this particular v2 service. However, the code to handle client connections to onion services in hs_client.c intends to handle introduction point failures, so we should really determine why the many attempts to connect to the same introduction point are not logged as a failure that would ultimately lead to fetching the server descriptor again.

My logs show that during a typical onion service request, my Tor attempted to connect to the same introduction point, 176.36.20.10, a total of 47 times in the 112-second interval between 06:52:06 and 06:53:58 UTC.

Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): delivering 33 cell forward.
Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): Sending a RELAY_EARLY cell; 4 remaining.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] append_cell_to_circuit_queue(): Made a circuit active.
Apr 21 06:53:55.000 [debug] scheduler_channel_has_waiting_cells(): Channel 64 at 0x1c12450 went from waiting_for_cells to pending
Apr 21 06:53:55.000 [info] circuit_get_best(): There is an intro circuit being created right now, but it has already taken quite a while. Starting one in parallel.
Apr 21 06:53:55.000 [info] circuit_get_open_circ_or_launch(): Chose $04D962C6155BFC3705DC3 8699D4E6B3CE1524AE7~$04D962C6155BFC3705 at 176.36.20.10 as intro point for '[scrubbed]'.
Apr 21 06:53:55.000 [debug] circuit_find_to_cannibalize(): Hunting for a circ to cannibalize: purpose 6, uptime 0, capacity 1, internal 1
Apr 21 06:53:55.000 [info] circuit_launch_by_extend_info(): Cannibalizing circ 3202873042 (id: 457) for purpose 6 (Hidden service client: Connecting to intro point)
Apr 21 06:53:55.000 [debug] circuit_change_purpose(): changing purpose of origin circ 457 from "General-purpose client" (5) to "Hidden service client: Connecting to intro point" (6)
Apr 21 06:53:55.000 [debug] circuit_send_intermediate_onion_skin(): starting to send subsequent skin.
Apr 21 06:53:55.000 [info] circuit_send_intermediate_onion_skin(): Sending extend relay cell.
Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): delivering 6 cell forward.
Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): Sending a RELAY_EARLY cell; 4 remaining.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] append_cell_to_circuit_queue(): Made a circuit active.
Apr 21 06:53:55.000 [info] rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2522 more seconds.
Apr 21 06:53:55.000 [info] connection_ap_handshake_attach_circuit(): Intro 3202873042 (id: 457) and rend circuit 2500882674 (id: 483) circuits are not both ready. Stalling conn. (109 sec old)
Apr 21 06:53:55.000 [info] circuit_predict_and_launch_new(): Have 12 clean circs (0 uptime-internal, 12 internal), need another hidden service circ.

Note that the same introduction point is chosen each time, and Tor never decides that the introduction point is invalid. Suggest that this may lead Tor to not request a new onion service descriptor. (It is not clear to me that the "interesting" AS in which the introduction point resides (39608, in the United Arab Emirates) is important in this case.) Suggest that the problem may be a matter of abstraction layers, e.g. Tor invalidates introduction points at a higher layer but retries at a lower layer such that the retries are opaque to the invalidation process.

This problem should be considered important, as it may lead to long-lived Tor clients becoming unable to connect to long-lived onion services, not to mention Type I errors in the systematic assessment of whether a given onion service is down.

Child Tickets

TicketStatusOwnerSummaryComponent
#21621newIntro points can get stuck in CIRCUIT_PURPOSE_S_ESTABLISH_INTROCore Tor/Tor
#27841closedneelSurprise race: Intro point closes circuit after NACK, at the same time as client tries to extend circuit to new intro pointCore Tor/Tor
#27842newConsider end-to-end introduction ACKsCore Tor/Tor
#31549merge_readynickmAuthorities should stop listing relays running pre-0.2.9, or running 0.3.0 through 0.3.4Core Tor/Tor

Attachments (8)

debug.log.part.gz (363.7 KB) - added by cypherpunks 18 months ago.
partial debug log
debug.A.log.gz (759.7 KB) - added by cypherpunks 18 months ago.
debug.B.log.gz (417.5 KB) - added by cypherpunks 18 months ago.
tor-25882.pdf (18.7 KB) - added by cypherpunks 18 months ago.
tor-25882.png (92.9 KB) - added by cypherpunks 18 months ago.
tmp_4396-tor-258823254419993784754180.pdf (1 bytes) - added by cypherpunks 18 months ago.
debug.onionservice.log.xz (1.0 MB) - added by cypherpunks2 15 months ago.
debug.onionservice.2.log.xz (1.1 MB) - added by cypherpunks2 15 months ago.

Change History (51)

comment:1 Changed 18 months ago by teor

Keywords: tor-hs added
Milestone: Tor: unspecified
Owner: teor deleted
Parent ID: #22455
Priority: HighMedium

The owner is the person who will fix the bug. That's not me.

This could be related to the rapid intro point retries in #21600.
It should be fixed as part of the retry / discovery review in #22455.

comment:2 Changed 18 months ago by asn

Keywords: tor-client added
Milestone: Tor: unspecifiedTor: 0.3.4.x-final

comment:3 Changed 18 months ago by cypherpunks

Keywords: tor-client removed
Milestone: Tor: 0.3.4.x-finalTor: unspecified

Per #21600, the problem is that we are not failing fast. We are retrying fast, but learning nothing from successive attempts.

comment:4 Changed 18 months ago by cypherpunks

Milestone: Tor: unspecifiedTor: 0.3.4.x-final

Apologies, I reverted the milestone by accident (race condition).

Last edited 18 months ago by cypherpunks (previous) (diff)

comment:5 Changed 18 months ago by cypherpunks

Another interesting characteristic of this phenomenon is that there is a correlation between the incidence of the behaviour in which a Tor client will reach out to introduction points many times in futility to reach a particular service, even when the Tor clients in question reside on different machines living on different parts of the Internet. Suggest that this means that there is something specific to the service that periodically breaks. Since restarting Tor solves the problem almost always, we might surmise that something cached is going stale.

This corroborates the inference that the introduction points are at fault. They are contained in the onion service descriptor, which is cached by the client.

comment:6 Changed 18 months ago by arma

Cc: dgoulet added

I think #21600 is not as related as you might think, since I think that ticket is for the onion service side.

Theory 1: the intro point went down or was otherwise unreachable, and the Tor client made a circuit to try to reach it, but the circuit timed out, and so we launch a new parallel one every second, rather than once like we had intended. See intro_going_on_but_too_old in circuituse.c.

Also, while you're looking at that clause:

    /* Log an info message if we're going to launch a new intro circ in
     * parallel */
    if (purpose == CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT &&
        !must_be_open && origin_circ->hs_circ_has_timed_out &&
        !circ->marked_for_close) {
        intro_going_on_but_too_old = 1;
        continue;
    }

It isn't until later, in circuit_is_acceptable, which we don't call because of the "continue", that we check that the origin_circ->hs_circ_has_timed_out was for *our* intended intro point, right? What's up with that? I'm cc'ing dgoulet since he touched that code recently. I guess it isn't *so* bad, since that bug only controls whether we log a (potentially confusing) info-level log?

comment:7 Changed 18 months ago by arma

Original poster: can you upload an info (or debug) level log of this bug happening for you? The above snippet isn't much to go on. For example, it doesn't show your 47 retries or your 112 seconds.

Changed 18 months ago by cypherpunks

Attachment: debug.log.part.gz added

partial debug log

comment:8 Changed 18 months ago by cypherpunks

(Please let me know if the log that I provided is insufficient.)

comment:9 Changed 18 months ago by cypherpunks

Could it be that the circuit between the onion service and the intro point was broken at some point, and subsequently the client is able to extend to the intro point but nonetheless cannot connect?

Changed 18 months ago by cypherpunks

Attachment: debug.A.log.gz added

Changed 18 months ago by cypherpunks

Attachment: debug.B.log.gz added

comment:10 Changed 18 months ago by cypherpunks

I was also able to observe this again. Today, two separate Tor clients, located on two separate networks, tried to access the same hidden service. It turns out that they both ran into the looping behaviour when they each tried to access what appears to be an introduction point 185.22.172.237.

I am attaching 30-minute segments from both log files here, as debug.A.log.gz and debug.B.log.gz.

comment:11 Changed 18 months ago by cypherpunks

Digging a bit deeper into debug.A.log, we find that Tor tries some random descriptors but ultimately keeps choosing 185.2.172.237 as an intro point. After 06:30:17, it is the only one remaining that has not been·

$ grep "\(as intro point\)\|\(Failed to reach\)" debug.A.log | head -40
Apr 27 06:30:01.000 [info] circuit_get_open_circ_or_launch(): Chose $DC161F3B2FF30446A14B69A1E6493FD6300A45D2~$DC161F3B2FF30446A1 at 185.50.191.250 as intro point for '[scrubbed]'.
Apr 27 06:30:01.000 [info] circuit_get_open_circ_or_launch(): Chose $35E8B344F661F4F2E68B17648F35798B44672D7E~$35E8B344F661F4F2E6 at 146.0.32.144 as intro point for '[scrubbed]'.
Apr 27 06:30:01.000 [info] circuit_get_open_circ_or_launch(): Chose $813BD9A5FF580C9BDDD9767E85D4CF6FC868B20D~$813BD9A5FF580C9BDD at 82.118.17.235 as intro point for '[scrubbed]'.
Apr 27 06:30:04.000 [info] circuit_get_open_circ_or_launch(): Chose $8FA6BBB26DDD237A22DD11866D196357F29BC19E~$8FA6BBB26DDD237A22 at 134.119.217.250 as intro point for '[scrubbed]'.
Apr 27 06:30:05.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:06.000 [info] circuit_get_open_circ_or_launch(): Chose $73A9297A6741DF33F4D1B7BB935B30B6A1D25B52~$73A9297A6741DF33F4 at 46.101.100.94 as intro point for '[scrubbed]'.
Apr 27 06:30:06.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:07.000 [info] circuit_get_open_circ_or_launch(): Chose $E1D0BA78D83BC0A5E7F1A791973022AE15C3548E~$E1D0BA78D83BC0A5E7 at 185.100.86.249 as intro point for '[scrubbed]'.
Apr 27 06:30:07.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 2 times.
Apr 27 06:30:07.000 [info] circuit_get_open_circ_or_launch(): Chose $500FD47E7503B0C51EAD9E250A0119EBEE56EC5C~$500FD47E7503B0C51E at 173.212.248.165 as intro point for '[scrubbed]'.
Apr 27 06:30:09.000 [info] circuit_get_open_circ_or_launch(): Chose $FAA935B0BC3B4348CCBD3EBA95681AB51730977D~$FAA935B0BC3B4348CC at 89.163.225.7 as intro point for '[scrubbed]'.
Apr 27 06:30:10.000 [info] circuit_get_open_circ_or_launch(): Chose $FDAC8BA3ABFCC107D1B1EAC953F195BEEBA7FF54~$FDAC8BA3ABFCC107D1 at 62.210.84.34 as intro point for '[scrubbed]'.
Apr 27 06:30:11.000 [info] circuit_get_open_circ_or_launch(): Chose $3E515AB55BCFDD1E9ED1515B60223A08EE7710F0~$3E515AB55BCFDD1E9E at 144.76.61.48 as intro point for '[scrubbed]'.
Apr 27 06:30:11.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 3 times.
Apr 27 06:30:12.000 [info] circuit_get_open_circ_or_launch(): Chose $3E515AB55BCFDD1E9ED1515B60223A08EE7710F0~$3E515AB55BCFDD1E9E at 144.76.61.48 as intro point for '[scrubbed]'.
Apr 27 06:30:12.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 4 times.
Apr 27 06:30:12.000 [info] circuit_get_open_circ_or_launch(): Chose $B493C692A62D8D24E93DD50113653A4D0152B7C5~$B493C692A62D8D24E9 at 176.24.237.138 as intro point for '[scrubbed]'.
Apr 27 06:30:13.000 [info] circuit_get_open_circ_or_launch(): Chose $3E515AB55BCFDD1E9ED1515B60223A08EE7710F0~$3E515AB55BCFDD1E9E at 144.76.61.48 as intro point for '[scrubbed]'.
Apr 27 06:30:13.000 [info] circuit_get_open_circ_or_launch(): Chose $0F85E6DB0E4A3B6FDDF7C9452C9B5D089B11979D~$0F85E6DB0E4A3B6FDD at 31.31.73.222 as intro point for '[scrubbed]'.
Apr 27 06:30:13.000 [info] circuit_get_open_circ_or_launch(): Chose $E0EAA536856FBD3C3AB46C2BAA398E7CDFDAFF5D~$E0EAA536856FBD3C3A at 77.87.50.6 as intro point for '[scrubbed]'.
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 5 times. Removing from descriptor.
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:15.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:15.000 [info] circuit_get_open_circ_or_launch(): Chose $99B6BE15B92A8746D14308F6C7FD84C28034335C~$99B6BE15B92A8746D1 at 188.230.183.231 as intro point for '[scrubbed]'.
Apr 27 06:30:17.000 [info] circuit_get_open_circ_or_launch(): Chose $D64366987CB39F61AD21DBCF8142FA0577B92811~$D64366987CB39F61AD at 37.221.162.226 as intro point for '[scrubbed]'.
Apr 27 06:30:20.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:22.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:25.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:27.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:29.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:31.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:34.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:36.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:38.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:40.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:42.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:45.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:48.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:51.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:53.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.

Notice that once we reach the intro point 185.22.172.237, Tor gets stuck. (Note that the log message about choosing 185.22.172.237 occurs 75 more times between 06:30:53 and 06:33:59.) Note that we are not noticing that it is unreachable, as we do for some other intro points:

$ grep "Failed" debug.A.log | more
Apr 27 06:30:05.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:05.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:06.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:06.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:07.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:07.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 2 times.
Apr 27 06:30:08.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:09.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:10.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:11.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:11.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 3 times.
Apr 27 06:30:12.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:12.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 4 times.
Apr 27 06:30:13.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 5 times. Removing from descriptor.
Apr 27 06:30:13.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:15.000 [info] circuit_about_to_free(): Failed intro circ [scrubbed] to [scrubbed] (building circuit to intro point). Marking intro point as possibly unreachable.

One of the problems with these logs is that my Tor is trying to connect to several different onion services around the same time. As you can see here, the descriptors for three of the hidden services in question are actually re-fetched, as they ought to be, although after 06:30:18, there is "1 options left" for the service that is stuck with 185.22.172.237 (note: it might be nice to have some kind of unique identifier, perhaps the stream ID, for each of these hidden services, so that we can correlate them with each other in the log, rather than just see the same "[scrubbed]" for each):

$ grep -i "rend_client_report_intro_point_failure" debug.A.log | more
Apr 27 06:30:04.000 [info] rend_client_report_intro_point_failure(): 5 options left for [scrubbed].
Apr 27 06:30:04.000 [info] rend_client_report_intro_point_failure(): 9 options left for [scrubbed].
Apr 27 06:30:05.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:05.000 [info] rend_client_report_intro_point_failure(): 5 options left for [scrubbed].
Apr 27 06:30:05.000 [info] rend_client_report_intro_point_failure(): 8 options left for [scrubbed].
Apr 27 06:30:05.000 [info] rend_client_report_intro_point_failure(): 4 options left for [scrubbed].
Apr 27 06:30:06.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:06.000 [info] rend_client_report_intro_point_failure(): 6 options left for [scrubbed].
Apr 27 06:30:06.000 [info] rend_client_report_intro_point_failure(): 3 options left for [scrubbed].
Apr 27 06:30:06.000 [info] rend_client_report_intro_point_failure(): 5 options left for [scrubbed].
Apr 27 06:30:07.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 2 times.
Apr 27 06:30:07.000 [info] rend_client_report_intro_point_failure(): 3 options left for [scrubbed].
Apr 27 06:30:07.000 [info] rend_client_report_intro_point_failure(): 7 options left for [scrubbed].
Apr 27 06:30:07.000 [info] rend_client_report_intro_point_failure(): 4 options left for [scrubbed].
Apr 27 06:30:08.000 [info] rend_client_report_intro_point_failure(): 6 options left for [scrubbed].
Apr 27 06:30:08.000 [info] rend_client_report_intro_point_failure(): 6 options left for [scrubbed].
Apr 27 06:30:08.000 [info] rend_client_report_intro_point_failure(): 2 options left for [scrubbed].
Apr 27 06:30:08.000 [info] rend_client_report_intro_point_failure(): 3 options left for [scrubbed].
Apr 27 06:30:08.000 [info] rend_client_report_intro_point_failure(): 5 options left for [scrubbed].
Apr 27 06:30:09.000 [info] rend_client_report_intro_point_failure(): 3 options left for [scrubbed].
Apr 27 06:30:09.000 [info] rend_client_report_intro_point_failure(): 1 options left for [scrubbed].
Apr 27 06:30:09.000 [info] rend_client_report_intro_point_failure(): 4 options left for [scrubbed].
Apr 27 06:30:09.000 [info] rend_client_report_intro_point_failure(): 2 options left for [scrubbed].
Apr 27 06:30:10.000 [info] rend_client_report_intro_point_failure(): 4 options left for [scrubbed].
Apr 27 06:30:10.000 [info] rend_client_report_intro_point_failure(): 1 options left for [scrubbed].
Apr 27 06:30:10.000 [info] rend_client_report_intro_point_failure(): 3 options left for [scrubbed].
Apr 27 06:30:11.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 3 times.
Apr 27 06:30:11.000 [info] rend_client_report_intro_point_failure(): 1 options left for [scrubbed].
Apr 27 06:30:11.000 [info] rend_client_report_intro_point_failure(): No more intro points remain for [scrubbed]. Re-fetching descriptor.
Apr 27 06:30:12.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 4 times.
Apr 27 06:30:12.000 [info] rend_client_report_intro_point_failure(): 1 options left for [scrubbed].
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 5 times. Removing from descriptor.
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): No more intro points remain for [scrubbed]. Re-fetching descriptor.
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): Failed to reach this intro point 1 times.
Apr 27 06:30:13.000 [info] rend_client_report_intro_point_failure(): 6 options left for [scrubbed].
Apr 27 06:30:15.000 [info] rend_client_report_intro_point_failure(): No more intro points remain for [scrubbed]. Re-fetching descriptor.
Apr 27 06:30:15.000 [info] rend_client_report_intro_point_failure(): 2 options left for [scrubbed].
Apr 27 06:30:18.000 [info] rend_client_report_intro_point_failure(): 1 options left for [scrubbed].

Note also that in circuit_about_to_free in circuitlist.c, we never seem to record any timeouts for the bad intro point, despite the fact that we are deciding that it is too old (see description of ticket and comment:6 above:

    if (orig_reason != END_CIRC_REASON_IP_NOW_REDUNDANT &&
        ocirc->rend_data) {
      /* treat this like getting a nack from it */
      log_info(LD_REND, "Failed intro circ %s to %s (awaiting ack). %s",
          safe_str_client(rend_data_get_address(ocirc->rend_data)),
          safe_str_client(build_state_get_exit_nickname(ocirc->build_state)),
          timed_out ? "Recording timeout." : "Removing from descriptor.");
      rend_client_report_intro_point_failure(ocirc->build_state->chosen_exit,
                                             ocirc->rend_data,
                                             timed_out ?
                                             INTRO_POINT_FAILURE_TIMEOUT :
                                             INTRO_POINT_FAILURE_GENERIC);
    }

The circuits are indeed reaped. For example, circuit 878 is freed about one second after its purpose is reassigned:

Apr 27 06:30:20.000 [info] circuit_get_open_circ_or_launch(): Chose $0074ECA82BD58B8BB1909C9C4F237FD9779B23FC~$0074ECA82BD58B8BB1 at 185.22.172.237 as intro point for '[scrubbed]'.
Apr 27 06:30:20.000 [debug] circuit_find_to_cannibalize(): Hunting for a circ to cannibalize: purpose 6, uptime 0, capacity 1, internal 1
Apr 27 06:30:20.000 [info] circuit_launch_by_extend_info(): Cannibalizing circ 3959015402 (id: 878) for purpose 6 (Hidden service client: Connecting to intro point)
Apr 27 06:30:20.000 [debug] circuit_change_purpose(): changing purpose of origin circ 878 from "General-purpose client" (5) to "Hidden service client: Connecting to intro point" (6)

[...]

Apr 27 06:30:21.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x125e268 for circ_id 3959015402, channel ID 174 (0x1afb648)
Apr 27 06:30:21.000 [debug] circuitmux_append_destroy_cell(): Cmux at 0x14fd5a0 queued a destroy for circ 3959015402, cmux counter is now 1, global counter is now 1
Apr 27 06:30:21.000 [debug] circuitmux_append_destroy_cell(): Primed a buffer.
Apr 27 06:30:21.000 [debug] channel_write_cell_generic_(): Writing p 0x14838f0 to channel 0x1afb648 with global ID 174
Apr 27 06:30:21.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x125e268 for circ_id 3959015402, channel ID 174 (0x1afb648)
Apr 27 06:30:21.000 [debug] channel_tls_get_overhead_estimate_method(): Estimated overhead ratio for TLS chan 174 is 1.040139
Apr 27 06:30:21.000 [debug] channel_update_xmit_queue_size(): Increasing queue size for channel 174 by 534 from 0 to 534
Apr 27 06:30:21.000 [debug] channel_update_xmit_queue_size(): Increasing global queue size by 534 for channel 174, new size is 534
Apr 27 06:30:21.000 [debug] circuitmux_notify_xmit_destroy(): Cmux at 0x14fd5a0 sent a destroy, cmux counter is now 0, global counter is now 0
Apr 27 06:30:21.000 [debug] channel_send_destroy(): Sending destroy (circID 3959015402) on channel 0x1afb648 (global ID 174)
Apr 27 06:30:21.000 [info] circuit_free(): Circuit 0 (id: 878) has been freed.

But why is the timeout not recorded? Looking in circuit_about_to_free in circuitlist.c, have we not set ocirc->rend_data? Or have we failed to set the purpose toCIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT (I would have guessed that we do this in send_introduce1 in circuituse.c.) More investigation would be useful, although I would surmise that adding debugging statements to circuit_about_to_free would be the next step.

comment:12 Changed 18 months ago by teor

(note: it might be nice to have some kind of unique identifier, perhaps the stream ID, for each of these hidden services, so that we can correlate them with each other in the log, rather than just see the same "[scrubbed]" for each):

There are no stream IDs until the rendezvous circuit opens.
The intro circuit IDs will probably change.
We would probably want something like the SOCKS request ID, if it exists. But even that could change.

Changed 18 months ago by cypherpunks

Attachment: tor-25882.pdf added

Changed 18 months ago by cypherpunks

Attachment: tor-25882.png added

comment:13 Changed 18 months ago by cypherpunks

The snippet of code in comment:11 is not called unless the circuit is in CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT. It would be interesting to confirm the purpose of the circuit just before it is deleted. More generally, it would be interesting to confirm how the circuit purposes change for the circuits that are in place during the period covered by the logs after we reach a bad intro point. In the case of debug.log.A, we can list all of the state transitions after 06:30:20:

$ tail -91500 debug.A.log | grep "changing purpose" | sed -e "s/.*[0-9] from/from/g" | sort | uniq -c 

     89 from "General-purpose client" (5) to "Hidden service client: Connecting to intro point" (6)
     89 from "General-purpose client" (5) to "Hidden service client: Establishing rendezvous point" (9)
      3 from "General-purpose client" (5) to "Measuring circuit timeout" (13)
     89 from "Hidden service client: Connecting to intro point" (6) to "Hidden service client: Waiting for ack from intro point" (7)
     89 from "Hidden service client: Establishing rendezvous point" (9) to "Hidden service client: Pending rendezvous point" (10)
     89 from "Hidden service client: Pending rendezvous point" (10) to "Hidden service client: Pending rendezvous point (ack received)" (11)
     89 from "Hidden service client: Waiting for ack from intro point" (7) to "Hidden service client: Received ack from intro point" (8)

We can represent the purpose changes in a state transition diagram, as follows, where the numbers in the circles represent the purposes, and the values on the edges represent the number of state transitions observed:


For reference, here is a table containing the states:

State Tag in or.h Description
5 CIRCUIT_PURPOSE_C_GENERAL "General-purpose client"
6 CIRCUIT_PURPOSE_C_INTRODUCING "Hidden service client: Connecting to intro point"
7 CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT "Hidden service client: Waiting for ack from intro point"
8 CIRCUIT_PURPOSE_C_INTRODUCE_ACKED "Hidden service client: Received ack from intro point"
9 CIRCUIT_PURPOSE_C_ESTABLISH_REND "Hidden service client: Establishing rendezvous point"
10 CIRCUIT_PURPOSE_C_REND_READY "Hidden service client: Pending rendezvous point"
11 CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED "Hidden service client: Pending rendezvous point (ack received)"
12 CIRCUIT_PURPOSE_C_REND_JOINED "Hidden service client: Active rendezvous point"
13 CIRCUIT_PURPOSE_C_MEASURE_TIMEOUT "Measuring circuit timeout"

We can conclude from the state transitions that circuits are not in fact "freed" when they are in transition state 7 (CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT). In fact, our reaped circuits are actually in either state 8 (CIRCUIT_PURPOSE_C_INTRODUCE_ACKED, in the case of circuits to the intro point) or state 11 (CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED, in the case of circuits to the rendezvous point) instead. Each intro point circuit has a "sister" circuit to the rendezvous point. Reaping an intro circuit from state 8 is OK, since this is just the circuit to request the rendezvous. However, rendezvous circuits from state 11 should transition to state 12 (CIRCUIT_PURPOSE_C_REND_JOINED), and the failure for them to do so indicates that somehow the intro point has failed to correctly instruct the onion service to rendezvous. So, it is therefore not sufficient to require them to be in state 7 when we are reporting a timeout. We should also check for state 11.

If a circuit is reaped whilst in state 11, then it is also in a timeout state worthy of reporting. Presumably, it will still be possible to report the failure of an intro point by reporting the failure of a rendezvous circuit dependent upon that intro point to complete the connection.

Therefore, suggest that we consider the following change:

--- circuitlist.c.orig  2018-04-30 23:02:07.322368461 +0000
+++ circuitlist.c       2018-04-30 23:03:01.721876917 +0000
@@ -2004,7 +2004,8 @@
      orig_reason);
   }
 
-  if (circ->purpose == CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT) {
+  if (circ->purpose == CIRCUIT_PURPOSE_C_INTRODUCE_ACK_WAIT ||
+      circ->purpose == CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED) {
     origin_circuit_t *ocirc = TO_ORIGIN_CIRCUIT(circ);
     int timed_out = (reason == END_CIRC_REASON_TIMEOUT);
     tor_assert(circ->state == CIRCUIT_STATE_OPEN);

What do you think?

Last edited 18 months ago by cypherpunks (previous) (diff)

Changed 18 months ago by cypherpunks

comment:14 Changed 18 months ago by cypherpunks

I ran Tor for a few days with the patch suggested in comment:13.

Indeed, we did see circuits reaped in state 11, so messages like this appeared in the log after each one:

Failed intro circ [scrubbed] to [scrubbed] (awaiting ack). Removing from descriptor.

These messages were followed by messages like this, where "2" might be another number:

rend_client_report_intro_point_failure(): 2 options left for [scrubbed].

Unfortunately the number of "options" left never decreased. I suspect that the reason for this is that in circuitlist.c:

      /* treat this like getting a nack from it */
      log_info(LD_REND, "Failed intro circ %s to %s (awaiting ack). %s",
          safe_str_client(rend_data_get_address(ocirc->rend_data)),
          safe_str_client(build_state_get_exit_nickname(ocirc->build_state)),
          timed_out ? "Recording timeout." : "Removing from descriptor.");
      rend_client_report_intro_point_failure(ocirc->build_state->chosen_exit,
                                             ocirc->rend_data,
                                             timed_out ?
                                             INTRO_POINT_FAILURE_TIMEOUT :
                                             INTRO_POINT_FAILURE_GENERIC);

When rend_client_report_intro_point_failure is called, the first argument is the chosen exit of this circuit, which in the case of rendezvous circuits is the rendezvous point, not the intro point. However, it is the intro point that is in the descriptor, and (based on the above) it is the intro point that is indeed problematic.

Yes, it is the failure of the REND circuit that in this case indicates that the intro point is bad.

If it is indeed the case that Tor sets up a REND circuit and a INTRODUCE circuit in one-to-one correspondence, as is suggested by the fact that 89 circuits each were in REND and INTRODUCE mode in the log fragment analysed in comment:13, then is it possible to know to which intro point a given REND circuit corresponds from the internal data structure of the REND circuit? If so, then we could pass the intro point as the first argument to rend_client_report_intro_point_failure instead.

comment:15 Changed 18 months ago by asn

Thanks for all the information here. Really useful and there is indeed nasty bugs hiding here!

IMO, there is a service-side bug that is causing this issue. Namely, for some reason, the HS has an introduction circuit open, which it ignores and does not reply to rendezvous requests coming from it. At the same time it seems like the HS has closed all its other intro circs and rotated descriptor. This might be related to #16702 and #17037 or not.

There is also a client-side issue, which is that the client continuously keeps on trying an introduction point which clearly does not work. The tor client should be smarter in this case and realize this and re-fetch the descriptor, but we should be smart here so that we don't cause needless HS desc fetches just because of broken rendezvous points or introduction points.

IMO, we should first address the service-side issue which seems to be the root issue here and then address the client-side issue.

To debug the service-side issue, we should try to reproduce it by setting up an HS and make it rotate descriptors and expire intro points quicker than usual (see remove_invalid_intro_points()), and then try connecting to it to see whether we can cause this behavior. FWIW, there is a whole mechanism in rendservice.c about expired intro points where we keep the intro circs open for a while, which I imagine is the root cause of this issue (see expiring_nodes).

Last edited 18 months ago by asn (previous) (diff)

comment:16 Changed 17 months ago by nickm

Owner: set to dgoulet

comment:17 Changed 17 months ago by dgoulet

Cc: dgoulet removed

Looking at debug.A.log, what I see is happening is:

  1. Client opens a rendezvous circuit.
  2. Client opens an intro circuit.
  3. Client establishes both and waits for the intro ACK.
  4. Client gets the intro ACK, closes that circuit.
  5. Client waits 2 seconds, considers the RP circuit expired and close it.
  6. Repeat (1).

Rinse and repeat non stop. So the same intro point being picked is normal because it is the only one that is working and doesn't time out/fail. Note, there might be other bugs but this is my initial observation.

Overall, the rendezvous circuit ends up in CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED state but then we don't apply a very long cutoff when evaluating to expire the circuit. Considering a service under heavy load, we should expand that cutoff way more so we leave a bit of time for the service to reach us?

Downside, the service might just be ignoring the client because it might have launched too many rendezvous circuits or is out of memory. We don't have an ACK that is sent from service to intro point to inform the client that "yes we are about to launch the RP circuit so please be patient".

I think, there are couple things we might want to consider.

  1. Extend the time out cutoff of an established waiting for service rendezvous circuit client side (CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED) so we give a bit of room to the service to communicate. 2 seconds timeout (in this case) is imo very short especially considering a popular service...
  1. Some sort of upper threshold on how many connection attempts we do to a service before we stop trying. For instance, we could consider that after 3 back and forth, we stop trying and close the SOCKS connection with an "Unable to connect" kind of error.

Right now, it will do the above dance for 120 seconds (the SOCKS port default timeout) so it puts pressure on the network where we could have stopped at 3 attempts before..

I personally think we should maybe aim for a combination of (a) and (b).

comment:18 in reply to:  17 ; Changed 17 months ago by asn

Replying to dgoulet:

I think, there are couple things we might want to consider.

  1. Extend the time out cutoff of an established waiting for service rendezvous circuit client side (CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED) so we give a bit of room to the service to communicate. 2 seconds timeout (in this case) is imo very short especially considering a popular service...
  1. Some sort of upper threshold on how many connection attempts we do to a service before we stop trying. For instance, we could consider that after 3 back and forth, we stop trying and close the SOCKS connection with an "Unable to connect" kind of error.

Right now, it will do the above dance for 120 seconds (the SOCKS port default timeout) so it puts pressure on the network where we could have stopped at 3 attempts before..

I personally think we should maybe aim for a combination of (a) and (b).

Approach seems reasonable but I'm not sure whether it addresses the root issue which is that the service has an intro circuit open to an intro point which it does not respond to (the client connects fine if it fetches a new HS desc). Do you think that's because of the small 2 second timeout, or could it be another bug?

comment:19 in reply to:  18 Changed 17 months ago by dgoulet

Replying to asn:

Approach seems reasonable but I'm not sure whether it addresses the root issue which is that the service has an intro circuit open to an intro point which it does not respond to (the client connects fine if it fetches a new HS desc). Do you think that's because of the small 2 second timeout, or could it be another bug?

Hmmm... I'm not seeing that in the logs. The intro point is fine and does relay the INTRO cell to the service, then the client gets the ACK from the intro point and wait on the RP.

The issue seems that the service does NOT go to the RP and that ultimately makes the RP circuit *client* side to timeout *but* quite fast (2sec).

comment:20 Changed 17 months ago by cypherpunks

I would suggest that the issue is that after the intro point sends the ACK, something goes wrong such that the HS does not establish a circuit to the RP. There are two possible "server-side" reasons for this:

  1. that the intro point fails to tell the HS to connect to the RP, because either (a) the circuit between the HS and the intro point is not working or (b) the intro point is misbehaving due to a bug or malice.
  2. that the HS fails to extend the circuit to the RP, because either (a) the HS cannot extend a circuit to the RP because it is somehow resource-bound or limited, (b) the HS cannot extend a circuit to the RP because it cannot establish new TCP connections to its designated guards, or (c) the HS is misbehaving due to a bug or malice.

I would suggest that (2) is unlikely, since fetching a new HS descriptor usually solves the problem for the client, although it is not inconceivable.

Approaches (a) and (b) described by dgoulet might offer some relief at the margins, particularly if the misbehaviour is amplified by CPU load or network load, but I would like to suggest that asn is right that they do not address either of the two possible underlying "server-side" problems.

Approaches (a) and (b) also not fully address the "client-side" problem, which is generally resolved by trying new intro points. Ideally, part of the solution from the client's perspective would involve fetching the HS descriptor again, even if only once per stream.

Proposal A. I would like to propose that a client facing this behaviour should require some kind of verification from the intro point that it has received confirmation from the HS that it has received the instruction to extend to the RP (preferably signed by the HS) or even that it has successfully extended to the RP (preferably signed by the RP and the HS). The client would not need to wait for this confirmation to establish a circuit to the RP. If the client does not receive that confirmation within some designated period of time, then it would close its circuit to the intro point, close its circuit to the RP, and remove the intro point from its list to try. I believe that this would be enough to trigger the client to fetch the descriptor if needed.

Proposal B. A "quick and dirty" client-side solution might have the client just maintain a pointer to the intro point the data structure for its circuit to the RP, and remove that intro point from its list whenever the circuit to the RP is killed in state 11. This would not be elegant and might have some false positives, although I think it would do the job.

Of course we should solve the "server-side" problem as well. I think that if we start with Proposal A, then we will also have a way to systematically debug the "server-side" issue. Proposal B might also be appropriate if an upcoming version of the HS circuit setup protocol is significantly different.

Last edited 17 months ago by cypherpunks (previous) (diff)

comment:21 Changed 17 months ago by nickm

Keywords: 034-deferred-20180602 added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final

Deferring non-must tickets to 0.3.5

comment:22 Changed 16 months ago by dgoulet

Milestone: Tor: 0.3.5.x-finalTor: unspecified

comment:23 Changed 16 months ago by dgoulet

Keywords: 035-removed added

Changed 15 months ago by cypherpunks2

Attachment: debug.onionservice.log.xz added

comment:24 Changed 15 months ago by cypherpunks2

The file debug.onionservice.log.xz contains a (compressed via XZ) log from a Tor running two onion services, one v2 service and one v3 service. Both of these services started at roughly 08:30 local time on 2018-07-10, although the log is from 2018-07-22 only.

Since that time, a separate Tor client running on another network has tested circuits to both onion services twice per hour, once at :00 and once at :30. (As of now that testing is still ongoing.) During each test, clients would attempt to connect to the service up to four times in succession and then log a failure if they could never connect.

The first failure was observed when the client could not reach the v3 service on the seven occasions between 20:00 and 23:00, inclusive, on 2018-07-22.

The log shows how the onion service handled the failure. Hopefully, this provides insight into the problem. (Note that the client succeeded in reaching the onion service at 23:30.)

comment:25 Changed 15 months ago by cypherpunks2

Note that in the server log file offered in the previous comment, a critical difference between the attempts that succeed and the attempts that fail, is that in the (v3) attempts that succeed, we see log lines like this:

... [info] hs_cell_parse_introduce2(): Received a decodable INTRODUCE2 cell ...

In the attempts that fail, we do not see such lines. I believe that this suggests that the onion service is not hearing from the intro point (or that it is not understanding what it is hearing).

Why is this happening?

comment:26 Changed 15 months ago by asn

Thanks for looking into this! That's very useful info.

I'll try to take a look soon, but with August coming up availability will be lower.

Changed 15 months ago by cypherpunks2

Attachment: debug.onionservice.2.log.xz added

comment:27 Changed 15 months ago by cypherpunks2

The file debug.onionservice.2.log.xz contains a pair of new failures, one for the v2 service and one for the v3 service, observed at about 21:00 local time.

Notice the absence of the receipt of an INTRODUCE2 cell in both cases.

It seems that our bug affects both v2 and v3 services, and the symptom is that the onion service does not receive an INTRODUCE2 cell, i.e. it does not hear from the introduction point.

So, two thoughts: what is the mechanism by which the set of introduction points known to a client is kept synchronised with the set of "live" introduction points maintained by an onion service? Note that a descriptor held by a client may become outdated, a descriptor held by the database may become outdated, and circuits maintained by the onion service may stop working...

Also, what is to stop a malicious (or buggy) introduction point from sending an ACK to a client but never reaching out to the onion service?

Last edited 15 months ago by cypherpunks2 (previous) (diff)

comment:28 Changed 14 months ago by asn

Priority: MediumHigh

This is a reachability issue for HSes. Marking as High severity.

comment:29 Changed 14 months ago by asn

Keywords: reachability added

comment:30 Changed 14 months ago by dmr

Cc: dmr added

comment:31 in reply to:  27 ; Changed 14 months ago by asn

Replying to cypherpunks2:

So, two thoughts: what is the mechanism by which the set of introduction points known to a client is kept synchronised with the set of "live" introduction points maintained by an onion service? Note that a descriptor held by a client may become outdated, a descriptor held by the database may become outdated, and circuits maintained by the onion service may stop working...

Hm. A descriptor held by a client can become outdated indeed. A client is supposed to re-fetch a descriptor if all intro points fail (see send_introduce1()). From what I remember in this bug, we sometimes don't mark an intro point as failing, so we don't refetch the desc?

Also, what is to stop a malicious (or buggy) introduction point from sending an ACK to a client but never reaching out to the onion service?

Nothing. This is indeed a protocol issue. A potential protocol fix here would be to require an end-to-end ACK all the way to the client from the service.

Doing that will rule out the above malicious case (which does not seem very useful from an adversary PoV), and also help us debug bugs like this. However, it will not help with issues like the outgoing rendezvous circuit timing out or failing on the service-side (probably what's happening here), and it will also slightly delay all rendezvous circuit setups because the ACK has to be end-to-end.

comment:32 in reply to:  31 Changed 14 months ago by cypherpunks2

Replying to asn:

Hm. A descriptor held by a client can become outdated indeed. A client is supposed to re-fetch a descriptor if all intro points fail (see send_introduce1()). From what I remember in this bug, we sometimes don't mark an intro point as failing, so we don't refetch the desc?

Yes, the problem is that clients do not mark an intro point as having failed when the only evidence that it has failed is that the corresponding rendezvous point is not working. And we know that from the perspective of a client, the failure of a rendezvous point is symptomatic of the case in which an onion service does not hear from the corresponding intro point in the first place. For a client to mark an intro point as having failed in this circumstance, it would need to maintain some kind of association between a rendezvous point and its corresponding intro point, for example as described in Proposal B, such that the client will correctly mark the intro point as having failed and re-fetch the onion service descriptor as required.

Nothing. This is indeed a protocol issue. A potential protocol fix here would be to require an end-to-end ACK all the way to the client from the service.

Doing that will rule out the above malicious case (which does not seem very useful from an adversary PoV), and also help us debug bugs like this. However, it will not help with issues like the outgoing rendezvous circuit timing out or failing on the service-side (probably what's happening here), and it will also slightly delay all rendezvous circuit setups because the ACK has to be end-to-end.

A small correction to your last sentence: the problem is not that the rendezvous circuit is failing on the service-side, but that the intro circuit is failing.

It is a useful attack if the adversary wants to degrade the performance of onion services network-wide by making them inaccessible from clients that would otherwise need to re-fetch the descriptor for onion services. If a client extends to a buggy or malicious intro point of the type described here, then it will not re-fetch the descriptor, or at least not until its time-based expiration, since from its perspective at least one intro point is still working. Considering that the need to re-fetch the descriptor prior to its time-based expiration is a necessary and expected part of normal operation for a client, the ability for an intro point to behave this way constitutes:

  • a significant denial-of-service risk to onion services, perhaps particularly those that expect multiple visits from the same client within the lifetime of its service descriptor, and
  • unnecessary network load and possible degradation of anonymity to clients who establish many circuits in rapid succession as they continue to attempt to reach the onion service without re-fetching its descriptor first.

So this is not benign.

Also, it is not true that the only possible fix is to have the client wait for the end-to-end ACK from the service. The intro point could deliver two (2) ACK messages, the first being the one that is currently given and the second being the end-to-end ACK that you describe. After receiving the first ACK, the client could reach out to the rendezvous point optimistically, and then if it does not hear the second ACK (i.e. the one from the service via the intro point) within a particular period of time, then it would mark the intro point as unreachable, toward re-fetching the descriptor for the service. That is the essence of Proposal A and ultimately my recommendation for a long-term solution.

Last edited 14 months ago by cypherpunks2 (previous) (diff)

comment:33 Changed 7 months ago by pili

Sponsor: Sponsor27

comment:34 Changed 7 months ago by sysrqb

For the record, I ran into this bug yesterday/today with one of the Debian dist onion services. I have some debug logs from the client-side showing it picked the same intro point for every request. I haven't looked at the logs too much beside this, though. I'll save the logs in case they are useful later. After I sent the process a NEWNYM signal (thanks asn), the descriptor was re-fetched and a connection was successfully established.

$ tor --version
Tor version 0.3.5.8.
$ cat /etc/debian_version 
9.8
$ dpkg -l tor | grep -e Name -e tor
||/ Name           Version                 Architecture Description
ii  tor            0.3.5.8-1~d90.stretch+1 amd64        anonymizing overlay network for TCP

comment:35 Changed 7 months ago by asn

Sponsor: Sponsor27Sponsor27-must

comment:36 Changed 6 months ago by asn

Points: 16

Assigning 15 points to this one. We need proper time to debug this and think of good solutions. This is a major reachability issue right now.

comment:37 Changed 6 months ago by pili

Parent ID: #22455#29995

comment:38 Changed 6 months ago by asn

also related to #22455

comment:39 Changed 6 months ago by asn

Parent ID: #29995#30200

comment:40 Changed 4 months ago by gaba

Owner: dgoulet deleted

dgoulet will assign himself to the ones he is working on right now.

comment:41 Changed 3 months ago by gaba

Keywords: network-team-roadmap-september added; 034-deferred-20180602 035-removed removed

comment:42 Changed 3 months ago by dgoulet

Points: 16

Points are now in the parent ticket.

comment:43 Changed 7 weeks ago by dgoulet

#31541 is probably a big start to fixing this.

I'm still investigating the circuit timeout issues so there might be more here.

Note: See TracTickets for help on using tickets.