Opened 7 years ago

Closed 5 years ago

#6799 closed defect (fixed)

Don't expire unused relay-to-relay TLS conns so quickly

Reported by: arma Owned by:
Priority: High Milestone: Tor: 0.2.5.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay anonymity-attack 025-triaged 024-backport andrea-review-0255
Cc: rpw Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

In git commit 67b38d506 we changed conn timeouts so relays would close idle conns (that is, conns without any circs on them) after 3 minutes rather than 15 minutes.

We made the change because of the "clients holding their dir-fetching conns open for a long time, filling up descriptor lists and memory and knocking over relays" fun:
http://archives.seul.org/tor/relays/Apr-2010/msg00073.html

But it would appear that we made worse the problem that Torscan exploits. We don't need to be so aggressive about closing connections to/from other relays (besides, they weren't the problem before).

There's a downside here, which is that we end up using more file descriptors on relays. But if most links are used already, we don't use many more. And if most links aren't used already, the Torscan problems are worse.

Child Tickets

TicketStatusOwnerSummaryComponent
#12023closedShould we update timestamp_last_added_nonpadding when we send a destroy cell?Core Tor/Tor

Change History (29)

comment:1 Changed 7 years ago by arma

Milestone: Tor: unspecified
Status: newneeds_information

Ralf, can you opine on whether raising the "close idle conn" timeout would help, and what values would help how much?

comment:2 Changed 7 years ago by arma

Our friendly irc person (who has alas disappeared -- hopefully he will have a nice vacation and return refreshed) suggested that a simple hack would be to not close it so aggressively if it's a canonical conn.

comment:3 Changed 7 years ago by nickm

That would seem to me like a fine approach to take.

comment:4 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:5 Changed 7 years ago by nickm

Component: Tor RelayTor

comment:6 Changed 6 years ago by arma

Keywords: anonymity-attack added

To protect against netflow logs, it would also be wise to randomize the timeout, so the time we close the tls conn doesn't tell us exactly when we closed the last circuit.

comment:7 Changed 6 years ago by nickm

Milestone: Tor: unspecifiedTor: 0.2.5.x-final
Priority: normalmajor

If this would really help anonymity, it seem easy enough to do in 0.2.5. How about a 15 minutes + uniform_random(10 minutes) timeout for canonical connections?

comment:8 Changed 6 years ago by nickm

Keywords: 025-triaged added
Status: needs_informationnew

comment:9 Changed 6 years ago by nickm

Keywords: 024-backport added
Status: newneeds_review

Simple fix candidate in branch "ticket6799_024".

comment:10 in reply to:  9 Changed 6 years ago by cypherpunks

Simple fix candidate in branch "ticket6799_024".

or_conn->idle_timeout is zero for non-canonical OR connections with this branch.

  if (bool_eq(is_canonical, or_conn->is_canonical))
    return;

It skips computing of or_conn->idle_timeout for non-canonical OR connections.

comment:11 Changed 6 years ago by nickm

Thanks, good catch. Should be fixed now.

comment:12 Changed 5 years ago by andrea

Keywords: andrea-review-0255 added

comment:13 Changed 5 years ago by nickm

still looks plausible to me, but it's complex enough that it could use another review IMO.

comment:14 Changed 5 years ago by arma

Looks plausible to me too. I've never fully grokked our is_canonical stuff (for example, how often in practice do we end up with two conns, each of which has one side thinking it's canonical?), but it looks like we're setting it as intended here.

To be clear, if one side runs the new code and one side runs the old code, then the old-code side will still close the connections at the earlier schedule?

And even if both sides are upgraded, then it isn't actually uniformly distributed between 15 and 22.5 minutes, since it will close first for whichever side chose the lower number? (This is fine, I just want to make sure I'm understanding it.)

I'd feel more comfortable here if somebody bumped up the severity of the

     log_info(LD_OR,"Expiring non-used OR connection to fd %d (%s:%d) "

line and then ran a network for a while in chutney, to confirm that things act roughly as we expect them to. But failing that, hey, what could go wrong.

comment:15 Changed 5 years ago by nickm

To be clear, if one side runs the new code and one side runs the old code, then the old-code side will still close the connections at the earlier schedule?

Yes.

And even if both sides are upgraded, then it isn't actually uniformly distributed between 15 and 22.5 minutes, since it will close first for whichever side chose the lower number? (This is fine, I just want to make sure I'm understanding it.)

Hm. Yes.

I'd feel more comfortable here if somebody bumped up the severity of [...] and then ran a network for a while in chutney

The problem there is that we don't have a chutney user simulator for this stuff, so the network mostly isn't used except if you're manually curl-ing over it. So I can test that, but it will only be a bit useful. Still, it could let us know if we have any truly egregious problems I guess.

comment:16 Changed 5 years ago by nickm

So, randomization and idle timeouts seem to be correctly handled. And non-canonical timeouts are getting killed by their idle timeouts:

net/nodes/004r/notice.log:May 15 17:06:24.000 [notice] Expiring non-used OR connection 0x7f103c1e6060 to fd 21 (127.0.0.1:49256) [idle 239, timeout 239, canonical=0].
net/nodes/004r/notice.log:May 15 17:28:17.000 [notice] Expiring non-used OR connection 0x7f103c1bb860 to fd 21 (127.0.0.1:49655) [idle 252, timeout 252, canonical=0].
net/nodes/004r/notice.log:May 15 17:40:21.000 [notice] Expiring non-used OR connection 0x7f103c259270 to fd 21 (127.0.0.1:49904) [idle 244, timeout 244, canonical=0].
net/nodes/006r/notice.log:May 15 17:14:04.000 [notice] Expiring non-used OR connection 0x7f018ab040c0 to fd 24 (127.0.0.1:51470) [idle 205, timeout 205, canonical=0].

And likewise for canonical connections on relays:

net/nodes/002a/notice.log:May 15 17:18:36.000 [notice] Expiring non-used OR connection 0x7fe7b325a590 to fd 24 (127.0.0.1:5003) [idle 936, timeout 936, canonical=1].
net/nodes/003r/notice.log:May 15 16:31:03.000 [notice] Expiring non-used OR connection 0x7f5bc23cceb0 to fd 30 (127.0.0.1:5002) [idle 900, timeout 900, canonical=1].
net/nodes/003r/notice.log:May 15 17:47:16.000 [notice] Expiring non-used OR connection 0x7f5bc23d95d0 to fd 24 (127.0.0.1:5002) [idle 1249, timeout 1249, canonical=1].
net/nodes/007r/notice.log:May 15 17:12:31.000 [notice] Expiring non-used OR connection 0x7fda748dc6a0 to fd 22 (127.0.0.1:5001) [idle 1100, timeout 1100, canonical=1].

But for canonical connections on clients, the idle timeouts seem to be somewhat irrelevant; something else in run_connection_housekeeping is keeping them alive for a long time past their idle timeouts:

net/nodes/009c/notice.log:May 15 15:31:19.000 [notice] Expiring non-used OR connection 0x7f9ec576e3e0 to fd 16 (127.0.0.1:5004) [idle 4556, timeout 955, canonical=1].
net/nodes/009c/notice.log:May 15 16:47:47.000 [notice] Expiring non-used OR connection 0x7f9ec5793180 to fd 5 (127.0.0.1:5002) [idle 3662, timeout 1273, canonical=1].
net/nodes/009c/notice.log:May 15 16:50:53.000 [notice] Expiring non-used OR connection 0x7f9ec5772bb0 to fd 13 (127.0.0.1:5006) [idle 7409, timeout 907, canonical=1].
net/nodes/009c/notice.log:May 15 16:53:59.000 [notice] Expiring non-used OR connection 0x7f9ec578b8b0 to fd 14 (127.0.0.1:5004) [idle 3677, timeout 1274, canonical=1].

Will investigate further, unless something happens to know just what the issue is.

comment:17 Changed 5 years ago by nickm

Here's what happens right before one of those unbalanced client-side canonical channel expirations:

May 15 22:54:07.000 [debug] circuit_expire_old_circuits_clientside(): Closing ci
rcuit that has been unused for 3626990 msec.
May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by
_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, 
channel ID 19 (0x7f0846e65b20)
May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by
_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, 
channel ID 19 (0x7f0846e65b20)
May 15 22:54:07.000 [debug] circuitmux_append_destroy_cell(): Cmux at 0x7f0846e4
b2b0 queued a destroy for circ 2927002286, cmux counter is now 1, global counter
 is now 1
May 15 22:54:07.000 [debug] circuitmux_append_destroy_cell(): Primed a buffer.
May 15 22:54:07.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0
x7f0846e65d38 to channel 0x7f0846e65b20 with global ID 19
May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by
_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, 
channel ID 19 (0x7f0846e65b20)
May 15 22:54:07.000 [debug] circuitmux_notify_xmit_destroy(): Cmux at 0x7f0846e4b2b0 sent a destroy, cmux counter is now 0, global counter is now 0
May 15 22:54:07.000 [debug] channel_send_destroy(): Sending destroy (circID 2927002286) on channel 0x7f0846e65b20 (global ID 19)
May 15 22:54:07.000 [notice] Expiring non-used OR connection 0x7f0846e28060 to fd 5 (127.0.0.1:5004) [idle 3627, timeout 1274, canonical=1].

Is it possible that when we fixed #7912, we didn't have client destroy cell update timestamp_last_added_nonpadding? I'm not sure this is actually a bad thing. Opened #12023 for that issue, though.

comment:18 Changed 5 years ago by nickm

There's an additional piece of trouble here: When a client kills the last circuit on a connection that has been idle for a while, the relay will still probably want to close the connection immediately, since the relay hasn't sent anything for quite a long time. Maybe that's another ticket?

comment:19 Changed 5 years ago by nickm

Since we only check timestamp_last_added_nonpadding to see if a circuit is idle when there are no circuits on the connection, maybe it would be smarter to see how long a connection has had no circuits on it, and close it a random time after it has had no circuits.

comment:20 Changed 5 years ago by nickm

I have a branch with a fix for the above issues as "ticket6799_024_v2". It has one additional commit that the others didn't. It seems pretty straightforward to me, but review is of course needed. Testing it now.

comment:21 Changed 5 years ago by nickm

One wrinkle here is that, with ticket6799_024_v2, client->relay connections with no circuits will always be closed by the relay, not the client, since the relay will see a non-canonical connection and the client will see a canonical connection. I don't think this hurts anything, since neither party is closing "immediately after a DESTROY" as they did before (See #12023).

comment:22 Changed 5 years ago by cypherpunks

 if (or_conn->chan) {

Why need this check? If it's NULL then code will crash later anyway:

if (channel_is_bad_for_new_circs(TLS_CHAN_TO_BASE(or_conn->chan)) &&
timestamp_last_had_circuits = chan->timestamp_last_had_circuits;

Happens twice. Why?

time_t timestamp_last_had_circuits = now;

Why inits not to zero? Why need to init?

comment:23 Changed 5 years ago by nickm

better now? (I had previously thought chan might be null because TLS_CHAN_TO_BASE tolerates NULL. But channel_is_bad_for_new_circs doesn't tolerate NULL, so chan must be set.)

Testing now with latest branch.

comment:24 in reply to:  19 Changed 5 years ago by arma

Replying to nickm:

maybe it would be smarter to see how long a connection has had no circuits on it, and close it a random time after it has had no circuits.

I think that's what we thought we did. So, I agree. :)

comment:25 Changed 5 years ago by andrea

Code review for the ticket6799_024_v2 branch:

051d599b4adba70312e23148f5e208075b673bae:

  • I think this is the case, but just to double-check: is the only visible behavior which depends on or_conn->idle_timeout closing the connection? This defense depends on there not being any way for the attacker to learn the randomized timeout.
  • All the code here looks correct to me modulo the bug fixed in add7c7c50c2ba7357d1bf22132f8b9985060f4b0

add7c7c50c2ba7357d1bf22132f8b9985060f4b0:

  • This looks fine to me.

b9919b7bae75f831d31ae5d3d11bb0b721bb9aab:

  • I think this patch is correct, but is this another case where things might break if the clock jumps and we should use CLOCK_MONOTONIC if available? That may be a bug in the old code too.
  • We run run_connection_housekeeping() once a second, so this has the same granularity as the randomized timeouts and doesn't reduce the effective entropy, but that's a bit by coincidence. Perhaps a comment to that effect somewhere?

fbc964b41b0bff4e55e90a1245dc65744abbebc2:

  • Isn't have_any_circuits = 0; redundant, since it's initialized to zero?

comment:26 in reply to:  25 Changed 5 years ago by nickm

Replying to andrea:

Code review for the ticket6799_024_v2 branch:

051d599b4adba70312e23148f5e208075b673bae:

  • I think this is the case, but just to double-check: is the only visible behavior which depends on or_conn->idle_timeout closing the connection? This defense depends on there not being any way for the attacker to learn the randomized timeout.

I believe that's the case; the only things that look at it (according to git grep idle_timeout in ticket6799_024_v2) are the timeout check in main.c which closes the connection, and the initialization check in connection_or which checks whether it's already set before setting it again.

b9919b7bae75f831d31ae5d3d11bb0b721bb9aab:

  • I think this patch is correct, but is this another case where things might break if the clock jumps and we should use CLOCK_MONOTONIC if available? That may be a bug in the old code too.

That's right. We have a ticket in 0.2.6 to do CLOCK_MONOTONIC migration in #10168. In 0.2.4, we don't have tor_gettimeofday_cached_monotonic, so this patch doesn't use it.

I don't think that our failure to use monotonic time here actually makes stuff worse than it was without this patch?

  • We run run_connection_housekeeping() once a second, so this has the same granularity as the randomized timeouts and doesn't reduce the effective entropy, but that's a bit by coincidence. Perhaps a comment to that effect somewhere?

I'm not sure what that comment should say, which probably means that we need one. I'd be happy to apply any comment patch you have there if you write one?

fbc964b41b0bff4e55e90a1245dc65744abbebc2:

  • Isn't have_any_circuits = 0; redundant, since it's initialized to zero?

yeah. I'll remove the initializer.

I just added a new fixup patch for that in ticket6799_024_v2, and squashed it all in ticket6799_024_v2_squashed. I'm going to review it all again, then merge.

comment:27 Changed 5 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final

Looked okay. I tweaked the changelog a bit and merged it to master with 3a2e25969fbb7bf. (There were conflicts, so taking a quick look over that merge commit might not be a bad idea.)

Putting into 0.2.4 milestone for possible backport.

comment:28 Changed 5 years ago by nickm

I'm leery of an 0.2.4 backport here; even though a lot of the net is still on 0.2.4, this is a 325-line patch.

comment:29 Changed 5 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final
Resolution: fixed
Status: needs_reviewclosed

No 0.2.4 backport here.

Note: See TracTickets for help on using tickets.