In git commit 67b38d506 we changed conn timeouts so relays would close idle conns (that is, conns without any circs on them) after 3 minutes rather than 15 minutes.
But it would appear that we made worse the problem that Torscan exploits. We don't need to be so aggressive about closing connections to/from other relays (besides, they weren't the problem before).
There's a downside here, which is that we end up using more file descriptors on relays. But if most links are used already, we don't use many more. And if most links aren't used already, the Torscan problems are worse.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Our friendly irc person (who has alas disappeared -- hopefully he will have a nice vacation and return refreshed) suggested that a simple hack would be to not close it so aggressively if it's a canonical conn.
To protect against netflow logs, it would also be wise to randomize the timeout, so the time we close the tls conn doesn't tell us exactly when we closed the last circuit.
If this would really help anonymity, it seem easy enough to do in 0.2.5. How about a 15 minutes + uniform_random(10 minutes) timeout for canonical connections?
Trac: Milestone: Tor: unspecified to Tor: 0.2.5.x-final Priority: normal to major
Looks plausible to me too. I've never fully grokked our is_canonical stuff (for example, how often in practice do we end up with two conns, each of which has one side thinking it's canonical?), but it looks like we're setting it as intended here.
To be clear, if one side runs the new code and one side runs the old code, then the old-code side will still close the connections at the earlier schedule?
And even if both sides are upgraded, then it isn't actually uniformly distributed between 15 and 22.5 minutes, since it will close first for whichever side chose the lower number? (This is fine, I just want to make sure I'm understanding it.)
I'd feel more comfortable here if somebody bumped up the severity of the
log_info(LD_OR,"Expiring non-used OR connection to fd %d (%s:%d) "
line and then ran a network for a while in chutney, to confirm that things act roughly as we expect them to. But failing that, hey, what could go wrong.
To be clear, if one side runs the new code and one side runs the old code, then the old-code side will still close the connections at the earlier schedule?
Yes.
And even if both sides are upgraded, then it isn't actually uniformly distributed between 15 and 22.5 minutes, since it will close first for whichever side chose the lower number? (This is fine, I just want to make sure I'm understanding it.)
Hm. Yes.
I'd feel more comfortable here if somebody bumped up the severity of [...] and then ran a network for a while in chutney
The problem there is that we don't have a chutney user simulator for this stuff, so the network mostly isn't used except if you're manually curl-ing over it. So I can test that, but it will only be a bit useful. Still, it could let us know if we have any truly egregious problems I guess.
So, randomization and idle timeouts seem to be correctly handled. And non-canonical timeouts are getting killed by their idle timeouts:
net/nodes/004r/notice.log:May 15 17:06:24.000 [notice] Expiring non-used OR connection 0x7f103c1e6060 to fd 21 (127.0.0.1:49256) [idle 239, timeout 239, canonical=0].net/nodes/004r/notice.log:May 15 17:28:17.000 [notice] Expiring non-used OR connection 0x7f103c1bb860 to fd 21 (127.0.0.1:49655) [idle 252, timeout 252, canonical=0].net/nodes/004r/notice.log:May 15 17:40:21.000 [notice] Expiring non-used OR connection 0x7f103c259270 to fd 21 (127.0.0.1:49904) [idle 244, timeout 244, canonical=0].net/nodes/006r/notice.log:May 15 17:14:04.000 [notice] Expiring non-used OR connection 0x7f018ab040c0 to fd 24 (127.0.0.1:51470) [idle 205, timeout 205, canonical=0].
And likewise for canonical connections on relays:
net/nodes/002a/notice.log:May 15 17:18:36.000 [notice] Expiring non-used OR connection 0x7fe7b325a590 to fd 24 (127.0.0.1:5003) [idle 936, timeout 936, canonical=1].net/nodes/003r/notice.log:May 15 16:31:03.000 [notice] Expiring non-used OR connection 0x7f5bc23cceb0 to fd 30 (127.0.0.1:5002) [idle 900, timeout 900, canonical=1].net/nodes/003r/notice.log:May 15 17:47:16.000 [notice] Expiring non-used OR connection 0x7f5bc23d95d0 to fd 24 (127.0.0.1:5002) [idle 1249, timeout 1249, canonical=1].net/nodes/007r/notice.log:May 15 17:12:31.000 [notice] Expiring non-used OR connection 0x7fda748dc6a0 to fd 22 (127.0.0.1:5001) [idle 1100, timeout 1100, canonical=1].
But for canonical connections on clients, the idle timeouts seem to be somewhat irrelevant; something else in run_connection_housekeeping is keeping them alive for a long time past their idle timeouts:
net/nodes/009c/notice.log:May 15 15:31:19.000 [notice] Expiring non-used OR connection 0x7f9ec576e3e0 to fd 16 (127.0.0.1:5004) [idle 4556, timeout 955, canonical=1].net/nodes/009c/notice.log:May 15 16:47:47.000 [notice] Expiring non-used OR connection 0x7f9ec5793180 to fd 5 (127.0.0.1:5002) [idle 3662, timeout 1273, canonical=1].net/nodes/009c/notice.log:May 15 16:50:53.000 [notice] Expiring non-used OR connection 0x7f9ec5772bb0 to fd 13 (127.0.0.1:5006) [idle 7409, timeout 907, canonical=1].net/nodes/009c/notice.log:May 15 16:53:59.000 [notice] Expiring non-used OR connection 0x7f9ec578b8b0 to fd 14 (127.0.0.1:5004) [idle 3677, timeout 1274, canonical=1].
Will investigate further, unless something happens to know just what the issue is.
Here's what happens right before one of those unbalanced client-side canonical channel expirations:
May 15 22:54:07.000 [debug] circuit_expire_old_circuits_clientside(): Closing circuit that has been unused for 3626990 msec.May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, channel ID 19 (0x7f0846e65b20)May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, channel ID 19 (0x7f0846e65b20)May 15 22:54:07.000 [debug] circuitmux_append_destroy_cell(): Cmux at 0x7f0846e4b2b0 queued a destroy for circ 2927002286, cmux counter is now 1, global counter is now 1May 15 22:54:07.000 [debug] circuitmux_append_destroy_cell(): Primed a buffer.May 15 22:54:07.000 [debug] channel_write_packed_cell(): Writing packed_cell_t 0x7f0846e65d38 to channel 0x7f0846e65b20 with global ID 19May 15 22:54:07.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f0846e25ff0 for circ_id 2927002286, channel ID 19 (0x7f0846e65b20)May 15 22:54:07.000 [debug] circuitmux_notify_xmit_destroy(): Cmux at 0x7f0846e4b2b0 sent a destroy, cmux counter is now 0, global counter is now 0May 15 22:54:07.000 [debug] channel_send_destroy(): Sending destroy (circID 2927002286) on channel 0x7f0846e65b20 (global ID 19)May 15 22:54:07.000 [notice] Expiring non-used OR connection 0x7f0846e28060 to fd 5 (127.0.0.1:5004) [idle 3627, timeout 1274, canonical=1].
Is it possible that when we fixed #7912 (moved), we didn't have client destroy cell update timestamp_last_added_nonpadding? I'm not sure this is actually a bad thing. Opened #12023 (moved) for that issue, though.
There's an additional piece of trouble here: When a client kills the last circuit on a connection that has been idle for a while, the relay will still probably want to close the connection immediately, since the relay hasn't sent anything for quite a long time. Maybe that's another ticket?
Since we only check timestamp_last_added_nonpadding to see if a circuit is idle when there are no circuits on the connection, maybe it would be smarter to see how long a connection has had no circuits on it, and close it a random time after it has had no circuits.
I have a branch with a fix for the above issues as "ticket6799_024_v2". It has one additional commit that the others didn't. It seems pretty straightforward to me, but review is of course needed. Testing it now.
One wrinkle here is that, with ticket6799_024_v2, client->relay connections with no circuits will always be closed by the relay, not the client, since the relay will see a non-canonical connection and the client will see a canonical connection. I don't think this hurts anything, since neither party is closing "immediately after a DESTROY" as they did before (See #12023 (moved)).
better now? (I had previously thought chan might be null because TLS_CHAN_TO_BASE tolerates NULL. But channel_is_bad_for_new_circs doesn't tolerate NULL, so chan must be set.)
I think this is the case, but just to double-check: is the only visible
behavior which depends on or_conn->idle_timeout closing the connection?
This defense depends on there not being any way for the attacker to learn
the randomized timeout.
All the code here looks correct to me modulo the bug fixed in
add7c7c50c2ba7357d1bf22132f8b9985060f4b0
add7c7c50c2ba7357d1bf22132f8b9985060f4b0:
This looks fine to me.
b9919b7bae75f831d31ae5d3d11bb0b721bb9aab:
I think this patch is correct, but is this another case where things might
break if the clock jumps and we should use CLOCK_MONOTONIC if available?
That may be a bug in the old code too.
We run run_connection_housekeeping() once a second, so this has the same
granularity as the randomized timeouts and doesn't reduce the effective
entropy, but that's a bit by coincidence. Perhaps a comment to that
effect somewhere?
fbc964b41b0bff4e55e90a1245dc65744abbebc2:
Isn't have_any_circuits = 0; redundant, since it's initialized to zero?
I think this is the case, but just to double-check: is the only visible
behavior which depends on or_conn->idle_timeout closing the connection?
This defense depends on there not being any way for the attacker to learn
the randomized timeout.
I believe that's the case; the only things that look at it (according to git grep idle_timeout in ticket6799_024_v2) are the timeout check in main.c which closes the connection, and the initialization check in connection_or which checks whether it's already set before setting it again.
b9919b7bae75f831d31ae5d3d11bb0b721bb9aab:
I think this patch is correct, but is this another case where things might
break if the clock jumps and we should use CLOCK_MONOTONIC if available?
That may be a bug in the old code too.
That's right. We have a ticket in 0.2.6 to do CLOCK_MONOTONIC migration in #10168 (moved). In 0.2.4, we don't have tor_gettimeofday_cached_monotonic, so this patch doesn't use it.
I don't think that our failure to use monotonic time here actually makes stuff worse than it was without this patch?
We run run_connection_housekeeping() once a second, so this has the same
granularity as the randomized timeouts and doesn't reduce the effective
entropy, but that's a bit by coincidence. Perhaps a comment to that
effect somewhere?
I'm not sure what that comment should say, which probably means that we need one. I'd be happy to apply any comment patch you have there if you write one?
fbc964b41b0bff4e55e90a1245dc65744abbebc2:
Isn't have_any_circuits = 0; redundant, since it's initialized to zero?
yeah. I'll remove the initializer.
I just added a new fixup patch for that in ticket6799_024_v2, and squashed it all in ticket6799_024_v2_squashed. I'm going to review it all again, then merge.
Looked okay. I tweaked the changelog a bit and merged it to master with 3a2e25969fbb7bf. (There were conflicts, so taking a quick look over that merge commit might not be a bad idea.)
Putting into 0.2.4 milestone for possible backport.
Trac: Milestone: Tor: 0.2.5.x-final to Tor: 0.2.4.x-final