Unfortunately, this logic (in connection_or_set_canonical()) is kind of a mess. Relays and clients are treated the same, and client connections are also kept alive for an additional hour by predictive circuit building even when otherwise idle, where as relays are not.
If we treat relays and clients separately for this timeout, we could reduce total client keep-alive time significantly (down to 30 minutes or so), and significantly increase relay connection lifetime. This should result in reduced total connection counts on relays, with better defenses against Torscan.
This is also needed in order to put reasonable bounds on padding overhead in #16861 (moved) for mobile clients. Furthermore, aside from the wieners running middle relays behind junky home routers who will whine about connection overflow, having a more well-connected Tor network is a good idea for many reasons (not just Torscan).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Ok, I think I want to combine CircuitIdleTimeout and PredictedPortsRelevanceTime into a single option (call it CircuitsAvailableTimeout?) and also randomize the value by some range when it is used.
Where CircuitIdleTimeout is currently used, I would sample a random timeout value on circuit creation and store it in origin_circuit_t. Where PredictedPortsRelevantTime is used, I think the right thing to do is to sample a new value whenever the list of predicted ports is empty.
For the TLS connection timeout, I want to explicitly separate canonical relay connections from client connections and non-canonical relay connections, and make the relay connection timeout be on the order of an hour (randomized +/- 25%) and be controlled by a consensus parameter. The client TLS connection timeout can be much shorter, since client TLS lifespan will be governed primarily by circuit activity (which will be controlled via CircuitsAvailableTimeout).
With these two sets of changes, it will be much easier to control how long TLS connections live (and thus more easily control network activity and padding), for both relays and clients.
I've tested it a bit in chutney and it has unittests. I'm posting it now for a quick review for any major refactoring/structural issues before I do more involved chutney/live network testing.
I like the change from port 80 as the default predicted port to port 443.
In options_validate:
Should we put minimum and maximum values on CircuitsAvailableTimeout, like the previous code for PredictedPortsRelevanceTime?
I think the only reason why we had minimum and maximum values here was due to the concerns about discrepancy between PredictedPortsRelevanceTime and CircuitIdleTimeout. CircuitIdleTimeout never had any ranges or limits. At least, that was my read of https://trac.torproject.org/projects/tor/ticket/9176#comment:7.
A nitpick:
In channelpadding_get_channel_idle_timeout:
I think server_mode() is the standard way of checking options->BridgeRelay || !options->ORPort_set. It does a few more checks, as far as I recall.
Ok, I fixed this and another instance in fixup commits atop netflow_padding-v4.
The documentation for channelpadding_get_channel_idle_timeout should describe units. It's seconds, right?
Should the magic numbers in the client-or-noncanonical case uses magic numbers. Should it use a network parameter instead, like the other stuff?
Let's document what a "circuit idle timeout" is.
How come this one doesn't look at any optoins like channelpadding_get_circuits_available_timeout does? Should it be looking at CircuitsAvailableTimeout?
channelpadding_get_circuits_available_timeout():
Let's also document in the function comment when this timeout controls exactly.
predicted_ports_prediction_time_remaining():
time_t differences are not guaranteed to fit into an int.
General stuff:
Should we be looking at monotonic time for any of this?
These issues should all be fixed in e1c57775ed8a09a8767bf4d74db94d04c9fe1659 in mikeperry/netflow_padding-v4_squashed+rebased. I opted to explicitly check for time_t overflow/underflow and associated clock jumps rather than use monotonic time. I don't like the idea of waiting for the clock to catch up with itself here, either. I'd rather reset the timeout in that case.
I will not be getting these revised and reviewed this week. I hold out hope for May. Sorry mike. Please let me know whether you want to revise them wrt my handles/timing patches, or whether I should. I'm happy either way.
My current understanding here is that mike means to revise this branch based on other merges we're doing. Moving these to needs_revision in the meantime. Please let me know if I'm incorrect.
Alright. I switched the code over to using the new handle, monotonic timer, and timer wheel abstractions. All unit tests pass without leaks from this code (though the unit tests have grown new memory leaks of their own).
mikeperry/netflow_padding-v6. The commit specific to this bug is 0e709402d9fe5cb48188bf65e335dcdedadb268b.
Trac: Milestone: Tor: 0.2.??? to Tor: 0.2.9.x-final Status: needs_revision to needs_review
Okay, review done. I have probably misunderstood a few important things. Please feel free to focus on correcting my misunderstandings here, so that I can say smarter things about the code.