Opened 4 years ago

Last modified 17 months ago

#16844 new defect

Slow clients can't bootstrap because they expire their consensus fetch but then receive all the bytes from it anyway, making them expire their next fetch, putting them in a terrible loop

Reported by: arma Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-client, low-bandwidth, sponsor4, bootstrap, 032-unreached
Cc: yawning, catalyst Actual Points:
Parent ID: Points: 2
Reviewer: Sponsor: Sponsor4

Description

If you start your Tor client with no cached directory info, and on a slow (high latency) link, you get:

$ tail -f tordebug-log|grep connected
Aug 09 16:17:12.299 [info] circuit_handle_first_hop(): Next router is $7EA6EAD6FD83083C538F44038BBFA077587DD755~7EA6EAD6FD83083C538 at 194.109.206.212: Not connected. Connecting.
Aug 09 16:17:12.826 [info] connection_edge_process_relay_cell_not_open(): 'connected' received for circid 2747423797 streamid 16685 after 0 seconds.
Aug 09 16:21:57.298 [info] circuit_handle_first_hop(): Next router is $9695DFC35FFEB861329B9F1AB04C46397020CE31~9695DFC35FFEB861329 at 128.31.0.39: Not connected. Connecting.
Aug 09 16:21:59.099 [info] connection_edge_process_relay_cell_not_open(): 'connected' received for circid 4248917890 streamid 42612 after 1 seconds.
Aug 09 16:22:09.711 [info] circuit_handle_first_hop(): Next router is $332CD489177F202570A7021328A17A91BF823889~332CD489177F202570A at 192.150.94.49: Not connected. Connecting.
Aug 09 16:22:09.711 [info] circuit_handle_first_hop(): Next router is $90E9E44FD74B98F87F7573F917AE4AF651B86B4C~90E9E44FD74B98F87F7 at 5.102.146.106: Not connected. Connecting.
Aug 09 16:22:09.712 [info] circuit_handle_first_hop(): Next router is $547C1CDB516798EC66A01F04A5884DCE1A151919~547C1CDB516798EC66A at 87.72.85.217: Not connected. Connecting.
Aug 09 16:22:12.499 [info] connection_edge_process_relay_cell_not_open(): 'connected' received for circid 2850575558 streamid 43745 after 0 seconds.
Aug 09 16:23:33.901 [info] connection_edge_process_relay_cell(): 'connected' received on circid 2850575558 for streamid 43746, no conn attached anymore. Ignoring.
Aug 09 16:24:11.599 [info] connection_edge_process_relay_cell_not_open(): 'connected' received for circid 4148503990 streamid 17036 after 0 seconds.
Aug 09 16:25:34.306 [info] connection_edge_process_relay_cell(): 'connected' received on circid 4148503990 for streamid 17037, no conn attached anymore. Ignoring.
Aug 09 16:26:29.559 [info] connection_edge_process_relay_cell_not_open(): 'connected' received for circid 2948078868 streamid 42748 after 0 seconds.
[...]

Oh hey, what's this "no conn attached anymore" issue?

$ grep 43746 tordebug-log
Aug 09 16:22:12.302 [info] connection_ap_handshake_send_begin(): Sending relay cell 1 on circ 2850575558 to begin stream 43746.
Aug 09 16:22:22.299 [debug] circuit_detach_stream(): Removing stream 43746 from circ 2850575558
Aug 09 16:23:33.901 [debug] connection_edge_process_relay_cell(): Now seen 1433 relay cells here (command 4, stream 43746).
Aug 09 16:23:33.901 [info] connection_edge_process_relay_cell(): 'connected' received on circid 2850575558 for streamid 43746, no conn attached anymore. Ignoring.
Aug 09 16:23:35.799 [debug] connection_edge_process_relay_cell(): Now seen 1434 relay cells here (command 2, stream 43746).
Aug 09 16:23:35.799 [info] connection_edge_process_relay_cell(): data cell dropped, unknown stream (streamid 43746).
[...]

We're hitting the 10-second stream timeout in connection_ap_expire_beginning(), and detaching the stream from the circuit, and presumably trying it again elsewhere.

But that last line above, "data cell dropped", is especially bad news for us -- we get the whole answer, and ignore it all, since we sent an 'end' cell changing our mind but the answer was already coming at us.

This situation comes up because of the optimistic data feature -- we get the answer to our request bundled with the 'connected' cell, which is a feature except in the case where we canceled (and then forgot about) the stream.

For people trying to bootstrap their Tor on a low-bandwidth high-latency network connection, I bet this landmine will be especially frustrating, since you will be clogging your network connection with directory information that you will discard, which in turn will delay the receipt of the other directory information.

You can reproduce the bug on your own, even if you have a great network connection, by starting your Tor with "bandwidthrate 2000 bandwidthburst 5000".

Child Tickets

TicketStatusOwnerSummaryComponent
#23116newtor stops responding to Ctrl-C and circuits while in infinite descriptor download loopCore Tor/Tor

Change History (33)

comment:1 Changed 4 years ago by arma

Branch arma/bug16844-logs is the patch that I used for tracking down the issue (and generating the above log lines).

Assuming it passes review, I think we should merge it. (But merging it will not resolve this ticket.)

comment:2 Changed 3 years ago by teor

Milestone: Tor: 0.2.???

comment:3 Changed 3 years ago by arma

Severity: Normal

This is a super important ticket. It is likely a big reason why people using Tor on low bandwidth connections (for example, the ones where especially at-risk users live) give up on bootstrapping because it takes unbearably long or fails entirely.

So it is sad that it has been left to rot in milestone "???". :)

comment:4 Changed 3 years ago by yawning

Cc: yawning added

comment:5 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.2.8.x-final

I wonder if #4483 will make this worse.

Possibly, although we limit the maximum number of connections to avoid flooding either the client connection, or the directory authorities.

It's a balancing act between increasing bootstrap reliability by trying multiple servers, and at the same time keeping the total amount of data transferred low.

comment:6 Changed 3 years ago by nickm

Priority: MediumHigh

comment:7 Changed 3 years ago by teor

We could allow only 2 simultaneous directory connections by setting TestingClientBootstrapConsensusMaxInProgressTries. But that would reduce bootstrap reliability on connections where some relays are blackholed.

We can also tune the connection schedules introduced in #4483 to make sure we don't try too many connections at once.

But I think the most effective change would be to increase the 10 second timeout in connection_ap_expire_beginning() to 20 or 30 seconds, and see how that goes.

comment:8 Changed 3 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

Throw most 0.2.8 "NEW" tickets into 0.2.9. I expect that many of them will subsequently get triaged out.

comment:9 Changed 3 years ago by arma

How about if we disable the stream timeouts for begindir requests?

Originally the stream timeouts were there to handle broken exit relays that e.g. don't have enough sockets, have broken dns resolvers, are censored from reaching the destination, etc. The goal was to dump the whole circuit, and try a new one (probably with a new exit). But for begindir requests, there should be a lot less variance between relays.

comment:10 in reply to:  1 ; Changed 3 years ago by arma

Status: newneeds_review

Replying to arma:

Branch arma/bug16844-logs is the patch that I used for tracking down the issue (and generating the above log lines).

Assuming it passes review, I think we should merge it. (But merging it will not resolve this ticket.)

Putting ticket in needs-review so people notice that I have a branch that makes debugging easier. (It's six months old, so hopefully we didn't change that much between then and now. Let me know if it's broken and I'll help clean it up.)

comment:11 in reply to:  9 ; Changed 3 years ago by yawning

Replying to arma:

How about if we disable the stream timeouts for begindir requests?

Originally the stream timeouts were there to handle broken exit relays that e.g. don't have enough sockets, have broken dns resolvers, are censored from reaching the destination, etc. The goal was to dump the whole circuit, and try a new one (probably with a new exit). But for begindir requests, there should be a lot less variance between relays.

Sockets/file descriptors on the other end are a finite resource and I don't feel particularly warm/fuzzy about connections that linger around for a long time (think Slowloris), especially given that we are shifting to making every Guard able to serve directory information, and there are probably Guards that are running on resource constrained hardware.

Relaxing the timeout may be ok, removing it entirely feels like a bad idea.

comment:12 Changed 3 years ago by nickm

Points: small/medium

comment:13 in reply to:  11 Changed 3 years ago by arma

Replying to yawning:

Replying to arma:

How about if we disable the stream timeouts for begindir requests?

Sockets/file descriptors on the other end are a finite resource and I don't feel particularly warm/fuzzy about connections that linger around for a long time (think Slowloris), especially given that we are shifting to making every Guard able to serve directory information, and there are probably Guards that are running on resource constrained hardware.

Relaxing the timeout may be ok, removing it entirely feels like a bad idea.

I'm talking about the voluntary timeout on the *client* side here. I think the server side needs its own defenses against people who choose not to time out, but that's different.

comment:14 Changed 3 years ago by teor

This timeout was useful before we implemented #4483, because we would make one connection, wait for it to timeout, try a second connection, then wait for it to timeout, and then we'd wait a while before trying again.

Now that we connect to multiple servers simultaneously, we should be able to relax the timeout to 30 seconds or something.

Last edited 3 years ago by arma (previous) (diff)

comment:15 in reply to:  14 ; Changed 3 years ago by arma

Replying to teor:

Now that we connect to multiple servers simultaneously, we should be able to relax the timeout to 30 seconds or something.

I think we should not retry begindir requests at all. We now have a mechanism for trying a series of directory servers, in a staggered way. Let's not layer two confusing mechanisms like that on top of each other. The original one that we had is broken. Let's stop using it.

comment:16 in reply to:  15 Changed 3 years ago by teor

Replying to arma:

Replying to teor:

Now that we connect to multiple servers simultaneously, we should be able to relax the timeout to 30 seconds or something.

I think we should not retry begindir requests at all. We now have a mechanism for trying a series of directory servers, in a staggered way. Let's not layer two confusing mechanisms like that on top of each other. The original one that we had is broken. Let's stop using it.

Oh, I think I have half-killed this timeout during bootstrap already. The bootstrap code doesn't retry after the 10 second timeout. It just tries another server whenever the schedule comes up. But what I do need to do is remove the close-on-timeout behaviour.

I left the non-bootstrap code using the old mechanism which times out after 10 seconds and retries. That's probably ok when a client has a live consensus and can afford to try for a few hours to get one. But I'd be happy to change that in 0.2.9 as well.

comment:17 in reply to:  10 Changed 3 years ago by arma

Replying to arma:

Replying to arma:

Branch arma/bug16844-logs is the patch that I used for tracking down the issue (and generating the above log lines).

Assuming it passes review, I think we should merge it. (But merging it will not resolve this ticket.)

Putting ticket in needs-review so people notice that I have a branch that makes debugging easier. (It's six months old, so hopefully we didn't change that much between then and now. Let me know if it's broken and I'll help clean it up.)

Actually, it looks like this branch got merged, by Nick, back in August?

If somebody can confirm, we should put this ticket back in state 'new'. Woo!

comment:18 Changed 3 years ago by nickm

Keywords: TorCoreTeam201604 added

Every postponed needs_review ticket should get a review in April

comment:19 Changed 3 years ago by nickm

Status: needs_reviewnew
[501]$ git describe --contains arma/bug16844-logs 
tor-0.2.7.3-rc~96^2
Last edited 21 months ago by arma (previous) (diff)

comment:20 Changed 3 years ago by nickm

Keywords: TorCoreTeam201604 removed

comment:21 Changed 3 years ago by arma

Keywords: low-bandwidth added

comment:22 Changed 3 years ago by nickm

Points: small/medium2

small/medium => 2.

comment:23 Changed 3 years ago by isabela

Keywords: isaremoved added
Milestone: Tor: 0.2.9.x-finalTor: 0.2.???

comment:24 Changed 2 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:25 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:26 Changed 2 years ago by nickm

Keywords: sponsor4 added

Bulk-adding "sponsor4" keyword to items that would appear to reduce low-bw clients' directory bandwidth usage. But we shouldn't build these without measurement/proposals: see #21205 and #21209.

comment:27 Changed 2 years ago by nickm

Sponsor: Sponsor4

Setting "sponsor4" sponsor on all tickets that have the sponsor4 keyword, but have no sponsor set.

comment:28 Changed 21 months ago by nickm

Remove Sponsor4 keyword, now that Sponsor4 is the value of the Sponsor field.

comment:29 Changed 21 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:30 Changed 21 months ago by nickm

Keywords: isaremoved removed

comment:31 Changed 20 months ago by arma

Milestone: Tor: unspecifiedTor: 0.3.2.x-final
Summary: Duplicate directory fetches when on high-latency connectionSlow clients can't bootstrap because they expire their consensus fetch but then receive all the bytes from it anyway, making them expire their next fetch, putting them in a terrible loop

Nick asked me to give this one a more dire title.

comment:32 Changed 20 months ago by catalyst

Cc: catalyst added
Keywords: bootstrap added

comment:33 Changed 17 months ago by nickm

Keywords: 032-unreached added
Milestone: Tor: 0.3.2.x-finalTor: unspecified

Mark a large number of tickets that I do not think we will do for 0.3.2.

Note: See TracTickets for help on using tickets.