Opened 4 years ago

Last modified 10 months ago

#14006 needs_information defect

Hidden service error: "We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits..."

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, circuit-management, scaling, 033-triage-20180320, 033-removed-20180320, 033-removed-20180403
Cc: fdsfgs@…, asn Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The HS operator in https://lists.torproject.org/pipermail/tor-dev/2014-December/007956.html saw this Tor log:

Dec 11 13:08:59.000 [notice] We'd like to launch a circuit to handle a
connection, but we already have 32 general-purpose client circuits
pending. Waiting until some finish. [268 similar message(s) suppressed
in last 600 seconds]

His network seems to be flaky so this might be the result of crappy network. However, we might want to investigate a bit further, since that message was supressed 250 times.

I can imagine situations in very busy hidden services, where 32 clients try to access them at the same time, which means that it tries to establish 32 circuits at the same time which might cause this problem.

Child Tickets

TicketStatusOwnerSummaryComponent
#24973needs_informationTor should be more gentle when launching dozens of circuits at onceCore Tor/Tor

Change History (25)

comment:1 Changed 4 years ago by dgoulet

I have a feeling that it's linked with this error also (see first log in the email).

Dec 12 18:10:27.000 [notice] Your Guard SECxFreeBSD64
($D7DB8E82604F806766FC3F80213CF719A0481D0B) is failing more circuits
than usual. Most likely this means the Tor network is overloaded.
Success counts are 199/285. Use counts are 101/101. 253 circuits
completed, 0 were unusable, 54 collapsed, and 15 timed out. For
reference, your timeout cutoff is 60 seconds.

I've run a perf experiment on an HS bombarding the service with hundreds of circuits, I never saw that "Waiting until some finish" log but the above was there. See https://trac.torproject.org/projects/tor/ticket/8902#comment:10 .

If the guard if having trouble to keep up with the traffic, that could explain why the HS can be stalled on circuits? Though the 600 seconds time is a bit worrying, 10 minutes seems much for the Guard to fail that long?...

comment:2 Changed 2 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:3 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:4 Changed 2 years ago by alecmuffett

Severity: Normal

Hi All!

I am using Donncha's OnionBalance to scrape the descriptors of 72x Tor Onion Services (spread over 6x machines) for a series of massive bandwidth experiments.

I, too, am getting this message, on a separate, standalone machine/daemon:

Dec 19 12:32:09.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [1675 similar message(s) suppressed in last 600 seconds]
Dec 19 12:42:10.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [1375 similar message(s) suppressed in last 600 seconds]
Dec 19 12:52:10.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [1256 similar message(s) suppressed in last 600 seconds]

Is there a number I can bump, please?

Last edited 2 years ago by alecmuffett (previous) (diff)

comment:6 in reply to:  4 Changed 2 years ago by asn

Replying to alecmuffett:

Hi All!

I am using Donncha's OnionBalance to scrape the descriptors of 72x Tor Onion Services (spread over 6x machines) for a series of massive bandwidth experiments.

I, too, am getting this message, on a separate, standalone machine/daemon:

Dec 19 12:32:09.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [1675 similar message(s) suppressed in last 600 seconds]
Dec 19 12:42:10.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [1375 similar message(s) suppressed in last 600 seconds]
Dec 19 12:52:10.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [1256 similar message(s) suppressed in last 600 seconds]

Is there a number I can bump, please?

Here is the code:

    const int n_pending = count_pending_general_client_circuits();

    /* Do we have too many pending circuits? */
    if (n_pending >= options->MaxClientCircuitsPending) {
      static ratelim_t delay_limit = RATELIM_INIT(10*60);
      char *m;
      if ((m = rate_limit_log(&delay_limit, approx_time()))) {
        log_notice(LD_APP, "We'd like to launch a circuit to handle a "
                   "connection, but we already have %d general-purpose client "
                   "circuits pending. Waiting until some finish.%s",
                   n_pending, m);
        tor_free(m);
      }
      return 0;
    }

You can try bumping MaxClientCircuitsPending from 32 to something bigger.

However, without understanding what these pending circuits are and why they are there, it's hard to fix the root cause of this issue. Perhaps with some tactical logging we can get more information about the nature of these circuits.

comment:7 Changed 2 years ago by asn

We should also consider whether we want to teach count_pending_general_client_circuits() to ignore CIRCUIT_STATE_GUARD_WAIT circuits as well, since post-prop271 we might have a few of those lying around and I'm not sure if we want to consider them pending.

comment:8 Changed 2 years ago by alecmuffett

However, without understanding what these pending circuits are and why they are there, it's hard to fix the root cause

I am using OnionBalance by Donncha to fetch the descriptors for $many Tor onion sites.

He's suggested that we can batch the fetches, which would perhaps help?

Also, if CIRCUIT_STATE_GUARD_WAIT is a kind of TIME_WAIT / waiting to die, then yes I would agree. :-)

Last edited 2 years ago by alecmuffett (previous) (diff)

comment:9 Changed 21 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:10 Changed 21 months ago by tokotoko

Cc: fdsfgs@… added

comment:11 Changed 21 months ago by nickm

Cc: asn dgoulet added
Keywords: circuit-management scaling added

comment:12 Changed 14 months ago by cstest

Just got the same error. Using Tor v0.3.1.9

"Dec 25 19:26:18.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [218775 similar message(s) suppressed in last 600 seconds]"

Service is running maybe 150 domains, not DDoSed, maybe 50-100 users, CPU usage up to 5% based on htop or 10% based on ps but it is decreasing. This issue was displayed after Tor service restart. Some of domains that worked fine before restart are not available (yet?).

Last edited 14 months ago by cstest (previous) (diff)

comment:13 Changed 14 months ago by teor

This message is likely due to the network being overloaded. There's not much you can do about it, we're trying to fix it on the relay side over the next few weeks.

This kind of network overload is one reason people shouldn't increase MaxClientCircuitsPending.

comment:14 Changed 13 months ago by arma

I am guessing it has to do with number of intro circuits we're trying to make.

comment:15 Changed 13 months ago by arma

Or oh hey, what about general-purpose circuits to upload new onion descriptors? We launch 6 or 8 of those at a time, and if there are several onion services being managed by this Tor... we can get to 32 right quick?

comment:16 Changed 13 months ago by cstest

Now on 3.2.9 number of suppressed messages "we already have 32 general-purpose client circuits" is now twice as it was on 3.1.9.

Jan 21 10:27:42.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [82275 similar message(s) suppressed in last 600 seconds]
Jan 21 10:37:42.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [215959 similar message(s) suppressed in last 600 seconds]
Jan 21 10:47:53.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [173631 similar message(s) suppressed in last 600 seconds]
Jan 21 10:53:40.000 [warn] Giving up launching first hop of circuit to rendezvous point $9844B981A80B3E4B50897098E2D65167E6AEF127~$9844B981A80B3E4B50 at 62.138.7.171 for service eb3w4t.....
Jan 21 10:53:43.000 [warn] Giving up launching first hop of circuit to rendezvous point $ECDC405E49183B2EAF579ACD42B443AEA2CF3729~$ECDC405E49183B2EAF at 185.81.109.2 for service eb3w4t.....
Jan 21 10:53:48.000 [warn] Giving up launching first hop of circuit to rendezvous point $C6ED9929EBBD3FCDFF430A1D43F5053EE8250A9B~$C6ED9929EBBD3FCDFF at 188.214.30.126 for service eb3w4t.....
Jan 21 10:57:52.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [85573 similar message(s) suppressed in last 600 seconds]
Jan 21 11:00:04.000 [warn] Hidden service g4e42twrg... exceeded launch limit with 10 intro points in the last 206 seconds. Intro circuit launches are limited to 10 per 300 seconds. [350 similar message(s) suppressed in last 300 seconds]
Jan 21 11:00:11.000 [warn] Couldn't relaunch rendezvous circuit to '$AF1D8F02C0949E9755C0DF9C6761FBBF7AAB62C2~$AF1D8F02C0949E9755 at 178.62.33.87'.
Jan 21 11:06:01.000 [notice] Your network connection speed appears to have changed. Resetting timeout to 60s after 18 timeouts and 1000 buildtimes.
Jan 21 11:07:52.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [294348 similar message(s) suppressed in last 600 seconds]
Jan 21 11:16:54.000 [warn] Requested exit point '$9AF9554365A51E6CE0804C32C4C4DC513FBFEF4D' is not known. Closing.
Jan 21 11:16:54.000 [warn] Requested exit point '$9AFAD70A59C60A0CEB63E4344E429DB0415FE29C' is not known. Closing.
Jan 21 11:16:54.000 [warn] Requested exit point '$9B2298757C56305D875F24051461A177B542A286' is not known. Closing.
Jan 21 11:16:54.000 [warn] Requested exit point '$43B89E0565B1D628DACB862F99D85B95B43AEAB8' is not known. Closing.
......
Jan 21 11:17:52.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [442355 similar message(s) suppressed in last 600 seconds]
Jan 21 11:27:52.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [348069 similar message(s) suppressed in last 600 seconds]

comment:17 Changed 13 months ago by cypherpunks

Milestone: Tor: unspecifiedTor: 0.3.3.x-final

comment:18 Changed 12 months ago by dgoulet

Replying to arma:

Or oh hey, what about general-purpose circuits to upload new onion descriptors? We launch 6 or 8 of those at a time, and if there are several onion services being managed by this Tor... we can get to 32 right quick?

Yes that is a problem. v2 uses 6 HSDirs so at 6 configured HS, you reach 32 circuits quickly. v3 uses hsdir_spread_store which is currently 4 meaning 8 HSDirs for every service. You configure 4 services and boom 32 circuits are launched.

But bumping MaxClientCircuitsPending is not really a good idea just for services.

The thing is that once the services have bootstrapped that is descriptor uploaded, after that they will re-upload at random timings between each other. But that one time at startup, we need the service to upload in mass. And this is for tor to try as fast as possible to make the service reachable.

So could we either:

1) Allow a burst at service startup if you have num_services * num_hsdirs > MaxClientCircuitsPending. I say service startup because one could do 10 ADD_ONION at once ;).

2) Have a special limit just for HS like MaxHSCircuitsPending and bump it to something bigger than 32.

3) Leave everything like this and after a while, once tor will be able to launch circuits, the descriptor will get uploaded. The operator just needs to deal with the delay.

4) <insert idea>

comment:19 Changed 12 months ago by dgoulet

Cc: dgoulet removed
Status: newneeds_information

comment:20 in reply to:  18 ; Changed 12 months ago by asn

Replying to dgoulet:

Replying to arma:

Or oh hey, what about general-purpose circuits to upload new onion descriptors? We launch 6 or 8 of those at a time, and if there are several onion services being managed by this Tor... we can get to 32 right quick?

Yes that is a problem. v2 uses 6 HSDirs so at 6 configured HS, you reach 32 circuits quickly. v3 uses hsdir_spread_store which is currently 4 meaning 8 HSDirs for every service. You configure 4 services and boom 32 circuits are launched.

But bumping MaxClientCircuitsPending is not really a good idea just for services.

The thing is that once the services have bootstrapped that is descriptor uploaded, after that they will re-upload at random timings between each other. But that one time at startup, we need the service to upload in mass. And this is for tor to try as fast as possible to make the service reachable.

So could we either:

1) Allow a burst at service startup if you have num_services * num_hsdirs > MaxClientCircuitsPending. I say service startup because one could do 10 ADD_ONION at once ;).

2) Have a special limit just for HS like MaxHSCircuitsPending and bump it to something bigger than 32.

3) Leave everything like this and after a while, once tor will be able to launch circuits, the descriptor will get uploaded. The operator just needs to deal with the delay.

4) <insert idea>

I think what I would prefer here is for Tor to rate-limit itself when building onion service circuits. Especially so when it has multiple onion services, but maybe even when it has only a single one. So instead of building all its onion circuits (IPs + hsdir circs) at once, it waits a randomized time (around a second?) before building each one.

That will slightly delay the bootup of HSes, but not by too much, and it's better for the health of the network. Not sure if this will be a PITA to engineer tho. I'm not sure if this is isomorphic to your (3) idea above, but if it is ten the warning message is not useful since it's intended.

Last edited 12 months ago by asn (previous) (diff)

comment:21 in reply to:  20 Changed 12 months ago by dgoulet

Replying to asn:

I think what I would prefer here is for Tor to rate-limit itself when building onion service circuits. Especially so when it has multiple onion services, but maybe even when it has only a single one. So instead of building all its onion circuits (IPs + hsdir circs) at once, it waits a randomized time (around a second?) before building each one.

The problem with adding a random delay at startup is that it won't solve the "32 general purpose circuits are pending" issue. If those circuits a really stuck being built, the delay won't help much as they will all end up queued up and stuck at some point.

A wise rate limit is probably what we want so we never go above that 32 limit and thus no need for a cryptic warning that makes it that you just can't do anything but wait or/and panic.

Now ok, looking a bit closely to the logs above, notice:

Jan 21 10:53:40.000 [warn] Giving up launching first hop of circuit to rendezvous point $9844B981A80B3E4B50897098E2D65167E6AEF127~$9844B981A80B3E4B50 at 62.138.7.171 for service eb3w4t.....

The above is a service trying to open a circuit to a rendezvous point... So I think the bigger issue here is that we have 32 circuits stuck in a non OPEN state and just never expire for some reasons? Or they do but we open 32 new ones very quickly and they get stalled again in a non OPEN state.

My money is on the later due to the *amount* of suppressed log (see below). This looks to me like a service getting a ridiculous amount of rendezvous requests, the Guard is chocking so we keep reaching that 32 limit.

Jan 21 10:37:42.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 32 general-purpose client circuits pending. Waiting until some finish. [215959 similar message(s) suppressed in last 600 seconds]

Quick skim, I don't see anything in circuit_expire_building() that would make a circuit be ignored in the GUARD_WAIT state so they should in theory expire even though they are waiting for the guard to be usable?

I'm getting indeed more and more convinced that we need a rate limit both client and service side. That is a bit like we do with DoS mitigation now (#24902) which is some per second rate with burst. Busy hidden service will suffer reachability but at least won't break the network. The point is that DoS mitigation will prevent as much as possible a client DDoS towards a single service and that service will by itself prevent to DDoS the network.

comment:22 Changed 11 months ago by nickm

Keywords: 033-triage-20180320 added

Marking all tickets reached by current round of 033 triage.

comment:23 Changed 11 months ago by nickm

Keywords: 033-removed-20180320 added

Mark all not-already-included tickets as pending review for removal from 0.3.3 milestone.

comment:24 Changed 11 months ago by asn

Keywords: 033-removed-20180403 added

comment:25 Changed 10 months ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: unspecified

This is not actionable for 033 because seems the above conclusion is to rate limit the HS circuit creation instead of raising the limit of 32.

Note: See TracTickets for help on using tickets.