Opened 5 weeks ago

Last modified 4 weeks ago

#24228 new defect

Tor keeps on creating new circuits even when it's idle

Reported by: asn Owned by:
Priority: Very High Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version: Tor: 0.3.1.1-alpha
Severity: Normal Keywords: tor-circuit
Cc: mikeperry Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Something is weird on latest master, and Tor will keep on creating new circuits even when it's completely idle. We are talking about 10 circuits per minute. I added a hook in circuit_predict_and_launch_new() and here is what I see:

Nov 10 15:56:26.000 [info] circuit_predict_and_launch_new(): Have 0 clean circs (0 internal), need another exit circ.
Nov 10 15:56:27.000 [info] circuit_predict_and_launch_new(): Have 1 clean circs (0 internal), need another exit circ.
Nov 10 15:56:28.000 [info] circuit_predict_and_launch_new(): Have 2 clean circs (0 uptime-internal, 0 internal), need another hidden service circ.
Nov 10 15:56:29.000 [info] circuit_predict_and_launch_new(): Have 3 clean circs (1 uptime-internal, 1 internal), need another hidden service circ.
Nov 10 15:56:30.000 [info] circuit_predict_and_launch_new(): Have 4 clean circs (2 uptime-internal, 2 internal), need another hidden service circ.
Nov 10 15:56:31.000 [info] circuit_predict_and_launch_new(): Have 4 clean circs (2 uptime-internal, 2 internal), need another hidden service circ.
Nov 10 15:56:42.000 [info] circuit_predict_and_launch_new(): Have 5 clean circs need another buildtime test circ.
Nov 10 15:56:53.000 [info] circuit_predict_and_launch_new(): Have 6 clean circs need another buildtime test circ.
Nov 10 15:57:04.000 [info] circuit_predict_and_launch_new(): Have 7 clean circs need another buildtime test circ.
Nov 10 15:57:15.000 [info] circuit_predict_and_launch_new(): Have 8 clean circs need another buildtime test circ.
Nov 10 15:57:26.000 [info] circuit_predict_and_launch_new(): Have 9 clean circs need another buildtime test circ.
Nov 10 15:57:37.000 [info] circuit_predict_and_launch_new(): Have 10 clean circs need another buildtime test circ.
Nov 10 15:57:48.000 [info] circuit_predict_and_launch_new(): Have 11 clean circs need another buildtime test circ.
Nov 10 15:57:50.000 [info] circuit_predict_and_launch_new(): Have 6 clean circs (0 uptime-internal, 0 internal), need another hidden service circ.
Nov 10 15:57:51.000 [info] circuit_predict_and_launch_new(): Have 7 clean circs (1 uptime-internal, 1 internal), need another hidden service circ.
Nov 10 15:57:52.000 [info] circuit_predict_and_launch_new(): Have 8 clean circs (2 uptime-internal, 2 internal), need another hidden service circ.
Nov 10 15:58:03.000 [info] circuit_predict_and_launch_new(): Have 9 clean circs need another buildtime test circ.
Nov 10 15:58:14.000 [info] circuit_predict_and_launch_new(): Have 10 clean circs need another buildtime test circ.
Nov 10 15:58:25.000 [info] circuit_predict_and_launch_new(): Have 8 clean circs need another buildtime test circ.
Nov 10 15:58:36.000 [info] circuit_predict_and_launch_new(): Have 9 clean circs need another buildtime test circ.
Nov 10 15:58:47.000 [info] circuit_predict_and_launch_new(): Have 10 clean circs need another buildtime test circ.
Nov 10 15:58:52.000 [info] circuit_predict_and_launch_new(): Have 6 clean circs (1 uptime-internal, 1 internal), need another hidden service circ.
Nov 10 15:58:53.000 [info] circuit_predict_and_launch_new(): Have 6 clean circs (1 uptime-internal, 1 internal), need another hidden service circ.
Nov 10 15:58:54.000 [info] circuit_predict_and_launch_new(): Have 7 clean circs (2 uptime-internal, 2 internal), need another hidden service circ.

Seems like Tor continuously thinks it needs HS circs and build time testing circs. This probably causes lots of unneeded traffic on the network.

Not sure how far back in Tor releases this goes.

Child Tickets

Change History (15)

comment:1 Changed 5 weeks ago by asn

This needs a fresh datadir. My torrc is simply:

DataDirectory /tmp/newdatadir/
SocksPort auto
Log notice stdout
Log info file /tmp/iiinfo.log

comment:2 Changed 5 weeks ago by dgoulet

Quick investigation and I think the place to look for a suspect is needs_hs_client_circuits() called by circuit_predict_and_launch_new()

Seems the CBT subsystem will set the circuit idle timeout to 60 seconds as long as we haven't observed at least CBT_DEFAULT_MIN_CIRCUITS_TO_OBSERVE = 100 circuits.

If you look at origin_circuit_new(), there is basically a big if/else condition where the first condition is if we don't have our CBT disabled and if we have observed enough circuits for tor to start using the predicted CBT, set the idle timeout is set to 60 sec (IDLE_TIMEOUT_WHILE_LEARNING).

So far so good but the problem comes with needs_hs_client_circuit() that wants 3 internal circuits at all time to be opened for HS "potential use". So every 60 sec or so (some random is added), tor will close at least one circuit and relaunch a new one because we need an HS circuit at that point.

That is looping until we reach the 100 circuit that CBT needs to start using the predictable idle timeout. In other words, a tor client will always open new circuits after expiring idle ones. When a client boots up with a fresh state file, it is a loop of 100 of them ;).

For an active client, this is a bit different but for an idle client, it is a constant stream of new circuits until 100 is reached.

comment:3 Changed 5 weeks ago by dgoulet

Keywords: tor-circuit regression added
Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final
Priority: MediumVery High

comment:4 Changed 5 weeks ago by dgoulet

Keywords: backport-031 added

Ok this has been introduced with commit: d5a151a06788c28ac1c50398c6e571d484774f47 (tor-0.3.1.1-alpha).

Which adds this 60 sec idle timeout for the first 100 circuits to origin_circuit_new()

In a nutshell, every tor client >= 031 are doing that behavior and that is an intense pressure on the network. At least no data is going on those.

comment:5 Changed 5 weeks ago by asn

Cc: mikeperry added

comment:6 Changed 5 weeks ago by asn

d5a151a067 seems to be from #17592.

I wonder why 60 secs was chosen in #17592 as the idle timeout while learning CBT, instead of something much greater like 12 hours or something...

comment:7 Changed 5 weeks ago by dgoulet

Continuing to dig here on why, this comes from #16861 (parent of #17592) and the changes file (changes/bug17592) reads:

+   - Increase the intial circuit build timeout testing frequency, to help
+     ensure that ReducedConnectionPadding clients finish learning a timeout
+     before their orconn would expire. The initial testing rate was set back
+     in the days of TAP and before the Tor Browser updater, when we had to be
+     much more careful about new clients making lots of circuits. With this
+     change, a circuit build time is learned in about 15-20 minutes, instead
+     of ~100-120 minutes.

Originally the timeout was 10 minutes:

+#define IDLE_TIMEOUT_WHILE_LEARNING (1*60)
-#define IDLE_TIMEOUT_WHILE_LEARNING (10*60)

Thus it seems that it *is* the intended behavior here. Are we sure about this? CBT will be happy after 15-20 minutes but at the cost of opening 100+ circuits when a fresh client starts up?

comment:8 Changed 5 weeks ago by asn

Milestone: Tor: 0.3.2.x-finalTor: 0.3.1.x-final
Version: Tor: 0.3.1.1-alpha

comment:9 Changed 5 weeks ago by asn

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

comment:10 Changed 5 weeks ago by mikeperry

The goal is to learn a circuit build timeout within 30 minutes, so that unused orconn connections aren't padded for too long while we learn this timeout (which wastes bandwidth for clients that want less padding). It sounds like we may actually learn it within 10. We could make this 3X slower I suppose.

But I don't really think new clients are going to put that much of a strain on the network with this. The ntor handshake completes in tens of microseconds, IIRC. And the rate of new clients arriving is not that high.

comment:11 in reply to:  10 ; Changed 5 weeks ago by catalyst

Replying to mikeperry:

But I don't really think new clients are going to put that much of a strain on the network with this. The ntor handshake completes in tens of microseconds, IIRC. And the rate of new clients arriving is not that high.

Still probably not great for clients that are tightly constrained in terms of battery or network.

comment:12 in reply to:  11 Changed 5 weeks ago by mikeperry

Replying to catalyst:

Replying to mikeperry:

But I don't really think new clients are going to put that much of a strain on the network with this. The ntor handshake completes in tens of microseconds, IIRC. And the rate of new clients arriving is not that high.

Still probably not great for clients that are tightly constrained in terms of battery or network.

The battery and bandwidth cost from padding overhead for keeping these connections open longer while learning a timeout will be much worse.

comment:13 in reply to:  10 ; Changed 5 weeks ago by asn

Replying to mikeperry:

The goal is to learn a circuit build timeout within 30 minutes, so that unused orconn connections aren't padded for too long while we learn this timeout (which wastes bandwidth for clients that want less padding). It sounds like we may actually learn it within 10. We could make this 3X slower I suppose.

But I don't really think new clients are going to put that much of a strain on the network with this. The ntor handshake completes in tens of microseconds, IIRC. And the rate of new clients arriving is not that high.

Hmm, not sure if it's just new clients. IIRC, CBT is per-guard, so when a client switches to a new guard (or its current guard gets offline/unreachable), it will start learning CBT of its next guard, aka destroy and create tons of idle circs over time.

Why is it important to learn CBT fast? What would happen if we learned CBT over a longer period of time, and used a bigger idle timeout value so that we don't destroy so many idle circuits?

Alternatively, perhaps we could disable the predictive circuit building while we area learning CBT for a guard? Or is this too much effort?

comment:14 in reply to:  13 Changed 4 weeks ago by mikeperry

Replying to asn:

Replying to mikeperry:

The goal is to learn a circuit build timeout within 30 minutes, so that unused orconn connections aren't padded for too long while we learn this timeout (which wastes bandwidth for clients that want less padding). It sounds like we may actually learn it within 10. We could make this 3X slower I suppose.

But I don't really think new clients are going to put that much of a strain on the network with this. The ntor handshake completes in tens of microseconds, IIRC. And the rate of new clients arriving is not that high.

Hmm, not sure if it's just new clients. IIRC, CBT is per-guard, so when a client switches to a new guard (or its current guard gets offline/unreachable), it will start learning CBT of its next guard, aka destroy and create tons of idle circs over time.

CBT is not per-guard. I first wrote it back when we used 3 guards, and does not associate any state with a guard id. It is only reset if you time out 18 out of 20 circuits in a rolling window. Otherwise it just gradually adjusts to changes like this.

Maybe you were confusing it with path bias? That info is per guard.

Why is it important to learn CBT fast? What would happen if we learned CBT over a longer period of time, and used a bigger idle timeout value so that we don't destroy so many idle circuits?

As I said to Catalyst, and in my previous comments, I lowered the CBT learning time so that we don't waste client battery and bandwidth on padding while keeping client connections opened for huge amounts of time while building test circuits. We're talking about the cost of crypto ops that take microseconds to complete vs the overhead of radio activity, CPU wake time, and bandwidth costs for keeping padded connections opened for *hours*.

Alternatively, perhaps we could disable the predictive circuit building while we area learning CBT for a guard? Or is this too much effort?

I don't think this accomplished what we want. Again, the point is to get the circuit building out of the way quickly, so we don't waste resources on keeping connections opened forever (and needlessly padding them during that time).

That said, 10 minutes *is* 3X faster than we really need. We could lower this by a factor of three and still get it done inside of the connection idle time for reduced padding clients.

comment:15 Changed 4 weeks ago by dgoulet

Keywords: regression backport-031 removed
Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

Seems not a bug or urgent here so postpone to 033. Removing misleading keywords as well.

Note: See TracTickets for help on using tickets.