Try to use only one canonical connection

changed milestone to %Tor: 0.3.1.x-final

added TorCoreTeam-postponed-201604 component::core tor/tor milestone::Tor: 0.3.1.x-final nickm-deferred-20160905 nickm-deferred-20161005 owner::mikeperry parent::16861 points::see-parent priority::high resolution::fixed review-group-9 reviewer::nickm severity::normal sponsor2 status::closed type::defect labels

I'm also tempted to patch channel_tls_matches_target_method() so that it allows extend cells to be sent on an orconn if they match either the descriptor address or the actual originating address of an orconn. This would also help converge on a single orconn for relays that have outbound traffic from different IPs as their inbound traffic.

However, it will also mean that it becomes possible to steal a relay's keys and start making TLS connections to all other relays from anywhere on the Internet, and wait for those connections to become old enough to be chosen for extends. This issue may outweigh the corner case. It probably does, in fact. Happy to hear thoughts, though. Maybe there are other things that would prevent this attack?

FYI: This commit implements the ideas in the description: https://gitweb.torproject.org/mikeperry/tor.git/commit/?h=netflow_padding-v3-squashed&id=86f950da4675a4c247236dc352bbeb3408f040eb

I chose not to mess with channel_tls_matches_target_method() because of the issues in comment:1.

I've tested it a bit in chutney. It seems to be behaving sanely. If this looks ok to do, I will bang on it harder and do some more thorough chutney testing.

Trac:
Status: new to needs_review

Roger suggested that we have some kind of check in the relays themselves for multiple connections so we can see how bad this is still in practice, in case this doesn't solve it all the way. I think such a check may be expensive, but we could do it on SIGUSR1, perhaps? Or once per day?

Trac:
Status: needs_review to needs_revision

Ok, after implementing the periodic check that Roger suggested, and after much chutney testing and code spelunking, I changed strategies here. Instead of granting canonical status to more things, I decided to add some checks so that relays are more likely to agree on their canonical status (inspired in part by Roger's comment at https://trac.torproject.org/projects/tor/ticket/6799#comment:14). For this, I use NETINFO peer address information to compare against what we are advertising for our router address, and if they disagree, the other side probably won't think we are canonical.

I then changed channel_is_better() to not only prefer older connections, but also prefer connections where we think the peer will decide we are canonical. With these updates to channel_is_better(), connection_or_set_bad_connections() will mark all of these "half-canonical" orcons as bad for circs if we ever have a "full-canonical" option available for use instead. It will also mark younger orcons as bad for circs, as it is actually better to prefer old orcons when defending against Torscan attacks. Orcons will still live for a max of 1 week regardless, though. I did not change that.

Here is the commit: https://gitweb.torproject.org/mikeperry/tor.git/commit/?h=netflow_padding-v4&id=d0a3ddd7814745a0760cc38b7d86e113e9be8b51

Oh, it also turns out that we're already vulnerable to the attack in comment:1, because all a rogue node has to do is list its rogue address in its NETINFO cells, and it gets marked canonical. It is only non-canonical connections that get their real_addr checked by channel_tls_matches_target_method(). Do we care about that? I did not change that behavior in this patch at all. I merely noted the issue with an XXX in the source.

Trac:
Status: needs_revision to needs_review

This patch looks good overall.

Just a few questions:

channel_check_for_duplicates() says:

This function is similar to connection_or_set_bad_connections(),
and probably could be adapted to replace it, if it was modified to actually
take action on any of these connections.

Are we waiting to see what it logs before using it to replace connection_or_set_bad_connections()?

Replying to mikeperry:

Oh, it also turns out that we're already vulnerable to the attack in comment:1, because all a rogue node has to do is list its rogue address in its NETINFO cells, and it gets marked canonical. It is only non-canonical connections that get their real_addr checked by channel_tls_matches_target_method(). Do we care about that? I did not change that behavior in this patch at all. I merely noted the issue with an XXX in the source.

Can we check real_addr for all connections? Will it take a long time to code up? Does it impact performance?

And a nitpick:

In check_canonical_channels_callback:

I think public_server_mode(options) is the standard way of saying !options->BridgeRelay && server_mode(options). I think they do the same thing, but it might be worth checking.

How do these changes affect the attack described in #13155 (moved)? "I can use an extend cell to remotely determine whether two relays have a connection open"

Replying to teor:

This patch looks good overall.

Just a few questions:

channel_check_for_duplicates() says: {{{ This function is similar to connection_or_set_bad_connections(), and probably could be adapted to replace it, if it was modified to actually take action on any of these connections. }}} Are we waiting to see what it logs before using it to replace connection_or_set_bad_connections()?

I think so. That or a switch to a datagram transport, or some other wider effort to completely remove all of the connection_or layer.

Replying to mikeperry:

Oh, it also turns out that we're already vulnerable to the attack in comment:1, because all a rogue node has to do is list its rogue address in its NETINFO cells, and it gets marked canonical. It is only non-canonical connections that get their real_addr checked by channel_tls_matches_target_method(). Do we care about that? I did not change that behavior in this patch at all. I merely noted the issue with an XXX in the source.

Can we check real_addr for all connections? Will it take a long time to code up? Does it impact performance?

I think the main problem is that if we don't allow this netinfo mechanism, we need to find a different way for IPv6 connections to become 'canonical'. If we do care about this (and maybe we do), I think it should probably be a different ticket to change this behavior. The right way to do it probably means checking that the netinfo cell stuff matches at least something from the descriptor. But maybe that will have other issues? Nick or Andrea probably need to chime in on that topic.

And a nitpick:

In check_canonical_channels_callback:

I think public_server_mode(options) is the standard way of saying !options->BridgeRelay && server_mode(options). I think they do the same thing, but it might be worth checking.

Fixed in another fixup commit.

Replying to mikeperry:

Replying to teor:

Replying to mikeperry:

Oh, it also turns out that we're already vulnerable to the attack in comment:1, because all a rogue node has to do is list its rogue address in its NETINFO cells, and it gets marked canonical. It is only non-canonical connections that get their real_addr checked by channel_tls_matches_target_method(). Do we care about that? I did not change that behavior in this patch at all. I merely noted the issue with an XXX in the source.

Can we check real_addr for all connections? Will it take a long time to code up? Does it impact performance?

I think the main problem is that if we don't allow this netinfo mechanism, we need to find a different way for IPv6 connections to become 'canonical'. If we do care about this (and maybe we do), I think it should probably be a different ticket to change this behavior. The right way to do it probably means checking that the netinfo cell stuff matches at least something from the descriptor. But maybe that will have other issues? Nick or Andrea probably need to chime in on that topic.

I'm working on IPv6 client support at the moment. These sort of complexities are one of the reasons I don't even want to touch the IPv6 server code.

And a nitpick:

In check_canonical_channels_callback:

I think public_server_mode(options) is the standard way of saying !options->BridgeRelay && server_mode(options). I think they do the same thing, but it might be worth checking.

Fixed in another fixup commit.

The fixups all look good.

Are we going to merge this? (Pending comment from Nick or Andrea on canonical IPv6 connections.)

oh crud, wasn't on my review queue because it didn't have a milestone.

Trac:
Milestone: N/A to Tor: 0.2.8.x-final

I've squashed the three major commits here into a new "netflow_padding-v4_squashed" branch so I can review them one-by-one.

"Your relay has a very large number of connections" should probably emphasize that it has a large number of connections per relay.

Otherwise this looks pretty nice...

Though going to "prefer new connections" means that rotating connections is going to be nigh-impossible, right?

The log message and some comments for the "prefer new connection" connection rotation behavior are in 8736b01c7fe0d599214067e20fe00d78c9f9de81 of mikeperry/netflow_padding-v4_squashed+rebased.

These seem like features, or like other stuff unlikely to be possible this month. Bumping them to 0.2.9

Trac:
Milestone: Tor: 0.2.8.x-final to Tor: 0.2.9.x-final

Trac:
Reviewer: N/A to N/A
Points: N/A to see-parent

Every postponed needs_review ticket should get a review in April

Trac:
Keywords: N/A deleted, TorCoreTeam201604 added

Trac:
Reviewer: N/A to nickm

I will not be getting these revised and reviewed this week. I hold out hope for May. Sorry mike. Please let me know whether you want to revise them wrt my handles/timing patches, or whether I should. I'm happy either way.

Trac:
Keywords: TorCoreTeam201604 deleted, TorCoreTeam-postponed-201604, TorCoreTeam201605 added

My current understanding here is that mike means to revise this branch based on other merges we're doing. Moving these to needs_revision in the meantime. Please let me know if I'm incorrect.

Trac:
Status: needs_review to needs_revision

Remove "TorCoreTeam201605" keyword. The time machine is broken.

Trac:
Keywords: TorCoreTeam201605 deleted, N/A added

Deferring many tickets that are in needs_revision with no progress noted for a while, where I think they could safely wait till 0.3.0 or later.

Please feel free to move these back to 0.2.9 if you finish the revisions soon.

Trac:
Milestone: Tor: 0.2.9.x-final to Tor: 0.2.???
Keywords: N/A deleted, nickm-deferred-20160905 added

Alright. I switched the code over to using the new handle, monotonic timer, and timer wheel abstractions. All unit tests pass without leaks from this code (though the unit tests have grown new memory leaks of their own).

mikeperry/netflow_padding-v6. The commit specific to this bug is 1ce02c50b5accd5059922e980745a43ca4ecfee5.

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.2.9.x-final
Status: needs_revision to needs_review

Trac:
Keywords: N/A deleted, review-group-9 added

I've commented a little on https://gitlab.com/nickm_tor/tor/merge_requests/8 .

Trac:
Status: needs_review to needs_revision

Trac:
Status: needs_revision to needs_review

Deferring big/risky-feature things (even the ones I really love!) to 0.3.0. Please argue if I'm wrong.

Trac:
Keywords: N/A deleted, nickm-deferred-20161005 added
Milestone: Tor: 0.2.9.x-final to Tor: 0.3.0.x-final

Trac:
Keywords: N/A deleted, review-group-11 added

Trac:
Keywords: review-group-11 deleted, review-group-12 added

Trac:
Keywords: review-group-12 deleted, review-group-13 added

And that's all for review-group-13.

Trac:
Keywords: review-group-13 deleted, review-group-14 added

Trac:
Milestone: Tor: 0.3.0.x-final to Tor: 0.3.1.x-final

Trac:
Keywords: review-group-14 deleted, N/A added

Trac:
Keywords: N/A deleted, review-group-16 added

Trac:
Priority: Medium to High
Keywords: review-group-16 deleted, review-group-16 sponsor2 added

Trac:
Keywords: review-group-16 sponsor2 deleted, sponsor2 added

merging parent

Trac:
Resolution: N/A to fixed
Status: needs_review to closed

closed

mentioned in issue #19416 (moved)

mentioned in issue #24841 (moved)

mentioned in issue #33899 (moved)

moved to tpo/core/tor#17604 (closed)

mentioned in issue tpo/core/tor#33899 (closed)

Try to use only one canonical connection

Child items ...

Activity