Don't expire unused relay-to-relay TLS conns so quickly

changed milestone to %Tor: 0.2.5.x-final

added 024-backport 025-triaged andrea-review-0255 anonymity-attack component::core tor/tor milestone::Tor: 0.2.5.x-final priority::high resolution::fixed status::closed tor-relay type::defect labels

Ralf, can you opine on whether raising the "close idle conn" timeout would help, and what values would help how much?

Trac:
Status: new to needs_information
Milestone: N/A to Tor: unspecified

Our friendly irc person (who has alas disappeared -- hopefully he will have a nice vacation and return refreshed) suggested that a simple hack would be to not close it so aggressively if it's a canonical conn.

That would seem to me like a fine approach to take.

Trac:
Keywords: N/A deleted, tor-relay added

Trac:
Component: Tor Relay to Tor

To protect against netflow logs, it would also be wise to randomize the timeout, so the time we close the tls conn doesn't tell us exactly when we closed the last circuit.

Trac:
Keywords: N/A deleted, anonymity-attack added

If this would really help anonymity, it seem easy enough to do in 0.2.5. How about a 15 minutes + uniform_random(10 minutes) timeout for canonical connections?

Trac:
Milestone: Tor: unspecified to Tor: 0.2.5.x-final
Priority: normal to major

Trac:
Status: needs_information to new
Keywords: tor-relay, anonymity-attack deleted, tor-relay anonymity-attack 025-triaged added

Simple fix candidate in branch "ticket6799_024".

Trac:
Keywords: tor-relay anonymity-attack 025-triaged deleted, tor-relay anonymity-attack 025-triaged 024-backport added
Status: new to needs_review

Simple fix candidate in branch "ticket6799_024". or_conn->idle_timeout is zero for non-canonical OR connections with this branch.

  if (bool_eq(is_canonical, or_conn->is_canonical))
    return;

It skips computing of or_conn->idle_timeout for non-canonical OR connections.

Thanks, good catch. Should be fixed now.

Trac:
Keywords: tor-relay anonymity-attack 025-triaged 024-backport deleted, tor-relay anonymity-attack 025-triaged 024-backport andrea-review-0255 added

still looks plausible to me, but it's complex enough that it could use another review IMO.

Looks plausible to me too. I've never fully grokked our is_canonical stuff (for example, how often in practice do we end up with two conns, each of which has one side thinking it's canonical?), but it looks like we're setting it as intended here.

To be clear, if one side runs the new code and one side runs the old code, then the old-code side will still close the connections at the earlier schedule?

And even if both sides are upgraded, then it isn't actually uniformly distributed between 15 and 22.5 minutes, since it will close first for whichever side chose the lower number? (This is fine, I just want to make sure I'm understanding it.)

I'd feel more comfortable here if somebody bumped up the severity of the

     log_info(LD_OR,"Expiring non-used OR connection to fd %d (%s:%d) "

line and then ran a network for a while in chutney, to confirm that things act roughly as we expect them to. But failing that, hey, what could go wrong.

To be clear, if one side runs the new code and one side runs the old code, then the old-code side will still close the connections at the earlier schedule?

Yes.

And even if both sides are upgraded, then it isn't actually uniformly distributed between 15 and 22.5 minutes, since it will close first for whichever side chose the lower number? (This is fine, I just want to make sure I'm understanding it.)

Hm. Yes.

I'd feel more comfortable here if somebody bumped up the severity of [...] and then ran a network for a while in chutney

The problem there is that we don't have a chutney user simulator for this stuff, so the network mostly isn't used except if you're manually curl-ing over it. So I can test that, but it will only be a bit useful. Still, it could let us know if we have any truly egregious problems I guess.

So, randomization and idle timeouts seem to be correctly handled. And non-canonical timeouts are getting killed by their idle timeouts:

net/nodes/004r/notice.log:May 15 17:06:24.000 [notice] Expiring non-used OR connection 0x7f103c1e6060 to fd 21 (127.0.0.1:49256) [idle 239, timeout 239, canonical=0].
net/nodes/004r/notice.log:May 15 17:28:17.000 [notice] Expiring non-used OR connection 0x7f103c1bb860 to fd 21 (127.0.0.1:49655) [idle 252, timeout 252, canonical=0].
net/nodes/004r/notice.log:May 15 17:40:21.000 [notice] Expiring non-used OR connection 0x7f103c259270 to fd 21 (127.0.0.1:49904) [idle 244, timeout 244, canonical=0].
net/nodes/006r/notice.log:May 15 17:14:04.000 [notice] Expiring non-used OR connection 0x7f018ab040c0 to fd 24 (127.0.0.1:51470) [idle 205, timeout 205, canonical=0].

And likewise for canonical connections on relays:

net/nodes/002a/notice.log:May 15 17:18:36.000 [notice] Expiring non-used OR connection 0x7fe7b325a590 to fd 24 (127.0.0.1:5003) [idle 936, timeout 936, canonical=1].
net/nodes/003r/notice.log:May 15 16:31:03.000 [notice] Expiring non-used OR connection 0x7f5bc23cceb0 to fd 30 (127.0.0.1:5002) [idle 900, timeout 900, canonical=1].
net/nodes/003r/notice.log:May 15 17:47:16.000 [notice] Expiring non-used OR connection 0x7f5bc23d95d0 to fd 24 (127.0.0.1:5002) [idle 1249, timeout 1249, canonical=1].
net/nodes/007r/notice.log:May 15 17:12:31.000 [notice] Expiring non-used OR connection 0x7fda748dc6a0 to fd 22 (127.0.0.1:5001) [idle 1100, timeout 1100, canonical=1].

But for canonical connections on clients, the idle timeouts seem to be somewhat irrelevant; something else in run_connection_housekeeping is keeping them alive for a long time past their idle timeouts:

net/nodes/009c/notice.log:May 15 15:31:19.000 [notice] Expiring non-used OR connection 0x7f9ec576e3e0 to fd 16 (127.0.0.1:5004) [idle 4556, timeout 955, canonical=1].
net/nodes/009c/notice.log:May 15 16:47:47.000 [notice] Expiring non-used OR connection 0x7f9ec5793180 to fd 5 (127.0.0.1:5002) [idle 3662, timeout 1273, canonical=1].
net/nodes/009c/notice.log:May 15 16:50:53.000 [notice] Expiring non-used OR connection 0x7f9ec5772bb0 to fd 13 (127.0.0.1:5006) [idle 7409, timeout 907, canonical=1].
net/nodes/009c/notice.log:May 15 16:53:59.000 [notice] Expiring non-used OR connection 0x7f9ec578b8b0 to fd 14 (127.0.0.1:5004) [idle 3677, timeout 1274, canonical=1].

Will investigate further, unless something happens to know just what the issue is.

Here's what happens right before one of those unbalanced client-side canonical channel expirations:

May 15 22:54:07.000 [debug] circuit_expire_old_circuits_clientside(): Closing ci
rcuit that has been unused for 3626990 msec.
May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by
_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, 
channel ID 19 (0x7f0846e65b20)
May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by
_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, 
channel ID 19 (0x7f0846e65b20)
May 15 22:54:07.000 [debug] circuitmux_append_destroy_cell(): Cmux at 0x7f0846e4
b2b0 queued a destroy for circ 2927002286, cmux counter is now 1, global counter
 is now 1
May 15 22:54:07.000 [debug] circuitmux_append_destroy_cell(): Primed a buffer.
May 15 22:54:07.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0
x7f0846e65d38 to channel 0x7f0846e65b20 with global ID 19
May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by
_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, 
channel ID 19 (0x7f0846e65b20)
May 15 22:54:07.000 [debug] circuitmux_notify_xmit_destroy(): Cmux at 0x7f0846e4b2b0 sent a destroy, cmux counter is now 0, global counter is now 0
May 15 22:54:07.000 [debug] channel_send_destroy(): Sending destroy (circID 2927002286) on channel 0x7f0846e65b20 (global ID 19)
May 15 22:54:07.000 [notice] Expiring non-used OR connection 0x7f0846e28060 to fd 5 (127.0.0.1:5004) [idle 3627, timeout 1274, canonical=1].

Is it possible that when we fixed #7912 (moved), we didn't have client destroy cell update timestamp_last_added_nonpadding? I'm not sure this is actually a bad thing. Opened #12023 (moved) for that issue, though.

There's an additional piece of trouble here: When a client kills the last circuit on a connection that has been idle for a while, the relay will still probably want to close the connection immediately, since the relay hasn't sent anything for quite a long time. Maybe that's another ticket?

Since we only check timestamp_last_added_nonpadding to see if a circuit is idle when there are no circuits on the connection, maybe it would be smarter to see how long a connection has had no circuits on it, and close it a random time after it has had no circuits.

I have a branch with a fix for the above issues as "ticket6799_024_v2". It has one additional commit that the others didn't. It seems pretty straightforward to me, but review is of course needed. Testing it now.

One wrinkle here is that, with ticket6799_024_v2, client->relay connections with no circuits will always be closed by the relay, not the client, since the relay will see a non-canonical connection and the client will see a canonical connection. I don't think this hurts anything, since neither party is closing "immediately after a DESTROY" as they did before (See #12023 (moved)).

 if (or_conn->chan) {

Why need this check? If it's NULL then code will crash later anyway:

if (channel_is_bad_for_new_circs(TLS_CHAN_TO_BASE(or_conn->chan)) &&

timestamp_last_had_circuits = chan->timestamp_last_had_circuits;

Happens twice. Why?

time_t timestamp_last_had_circuits = now;

Why inits not to zero? Why need to init?

better now? (I had previously thought chan might be null because TLS_CHAN_TO_BASE tolerates NULL. But channel_is_bad_for_new_circs doesn't tolerate NULL, so chan must be set.)

Testing now with latest branch.

Replying to nickm:

maybe it would be smarter to see how long a connection has had no circuits on it, and close it a random time after it has had no circuits.

I think that's what we thought we did. So, I agree. :)

Code review for the ticket6799_024_v2 branch:

051d599b4adba70312e23148f5e208075b673bae:

I think this is the case, but just to double-check: is the only visible behavior which depends on or_conn->idle_timeout closing the connection? This defense depends on there not being any way for the attacker to learn the randomized timeout.
All the code here looks correct to me modulo the bug fixed in add7c7c50c2ba7357d1bf22132f8b9985060f4b0

add7c7c50c2ba7357d1bf22132f8b9985060f4b0:

This looks fine to me.

b9919b7bae75f831d31ae5d3d11bb0b721bb9aab:

I think this patch is correct, but is this another case where things might break if the clock jumps and we should use CLOCK_MONOTONIC if available? That may be a bug in the old code too.
We run run_connection_housekeeping() once a second, so this has the same granularity as the randomized timeouts and doesn't reduce the effective entropy, but that's a bit by coincidence. Perhaps a comment to that effect somewhere?

fbc964b41b0bff4e55e90a1245dc65744abbebc2:

Isn't have_any_circuits = 0; redundant, since it's initialized to zero?

Replying to andrea:

Code review for the ticket6799_024_v2 branch:

051d599b4adba70312e23148f5e208075b673bae:

I think this is the case, but just to double-check: is the only visible behavior which depends on or_conn->idle_timeout closing the connection? This defense depends on there not being any way for the attacker to learn the randomized timeout.

I believe that's the case; the only things that look at it (according to git grep idle_timeout in ticket6799_024_v2) are the timeout check in main.c which closes the connection, and the initialization check in connection_or which checks whether it's already set before setting it again.

b9919b7bae75f831d31ae5d3d11bb0b721bb9aab:

I think this patch is correct, but is this another case where things might break if the clock jumps and we should use CLOCK_MONOTONIC if available? That may be a bug in the old code too.

That's right. We have a ticket in 0.2.6 to do CLOCK_MONOTONIC migration in #10168 (moved). In 0.2.4, we don't have tor_gettimeofday_cached_monotonic, so this patch doesn't use it.

I don't think that our failure to use monotonic time here actually makes stuff worse than it was without this patch?

We run run_connection_housekeeping() once a second, so this has the same granularity as the randomized timeouts and doesn't reduce the effective entropy, but that's a bit by coincidence. Perhaps a comment to that effect somewhere?

I'm not sure what that comment should say, which probably means that we need one. I'd be happy to apply any comment patch you have there if you write one?

fbc964b41b0bff4e55e90a1245dc65744abbebc2:

Isn't have_any_circuits = 0; redundant, since it's initialized to zero?

yeah. I'll remove the initializer.

I just added a new fixup patch for that in ticket6799_024_v2, and squashed it all in ticket6799_024_v2_squashed. I'm going to review it all again, then merge.

Looked okay. I tweaked the changelog a bit and merged it to master with 3a2e25969fbb7bf. (There were conflicts, so taking a quick look over that merge commit might not be a bad idea.)

Putting into 0.2.4 milestone for possible backport.

Trac:
Milestone: Tor: 0.2.5.x-final to Tor: 0.2.4.x-final

I'm leery of an 0.2.4 backport here; even though a lot of the net is still on 0.2.4, this is a 325-line patch.

No 0.2.4 backport here.

Trac:
Milestone: Tor: 0.2.4.x-final to Tor: 0.2.5.x-final
Resolution: N/A to fixed
Status: needs_review to closed

closed

mentioned in issue #12023 (moved)

mentioned in issue #13155 (moved)

mentioned in issue #16861 (moved)

mentioned in issue #17592 (moved)

moved to tpo/core/tor#6799 (closed)

mentioned in issue tpo/core/tor#12023 (closed)

mentioned in issue tpo/core/tor#17592 (closed)

Don't expire unused relay-to-relay TLS conns so quickly

Child items 0

Activity