Opened 4 months ago

Last modified 3 months ago

#27066 needs_information defect

circuit_build_times_update_alpha(): Bug: Could not determine largest build time

Reported by: cstest Owned by:
Priority: Medium Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version: Tor: 0.3.3.9
Severity: Normal Keywords: 034-backport 033-backport
Cc: mikeperry Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This bug occurs only when v3 hidden services are activated (4 hours after activation v3 services). This was my second test of v3 services and this issue repeated again. Server was running aprox.150 v3 domains.

After below warnings, all hidden services stoped working.

Could be related to #25733.

Aug 07 04:45:49.000 [warn] circuit_build_times_update_alpha(): Bug: Could not determine largest build time (0). Xm is 7525ms and we've abandoned 996 out of 1000 circuits. (on Tor 0.3.3.9 )
Aug 07 04:45:49.000 [warn] circuit_build_times_update_alpha(): Bug: Could not determine largest build time (0). Xm is 7525ms and we've abandoned 997 out of 1000 circuits. (on Tor 0.3.3.9 )
Aug 07 04:45:49.000 [warn] circuit_build_times_update_alpha(): Bug: Could not determine largest build time (0). Xm is 7525ms and we've abandoned 998 out of 1000 circuits. (on Tor 0.3.3.9 )
Aug 07 04:45:49.000 [warn] circuit_build_times_update_alpha(): Bug: Could not determine largest build time (0). Xm is 7525ms and we've abandoned 999 out of 1000 circuits. (on Tor 0.3.3.9 )
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
Aug 07 04:45:49.000 [warn] No valid circuit build time data out of 1000 times, 3 modes, have_timeout=1, 7521.000000ms
...........

Child Tickets

Change History (12)

comment:1 Changed 4 months ago by nickm

Cc: mikeperry added
Keywords: 034-backport 033-backport added
Milestone: Tor: 0.3.5.x-final

comment:2 Changed 4 months ago by mikeperry

Are there any other logs present? For example, "No circuits are opened" by any chance (#27049)? That is my best guess right now -- if every circuit is getting marked "relaxed" near that logline, no timeouts are getting stored properly.

That would explain the resulting CBT mess, but still does not explain why v3 services stop working, so there must be a second piece to this after CBT fails.

Is this Tor instance only running v3 services, or v2 services also?

comment:3 Changed 4 months ago by cstest

There were aprox. 4000 lines of "No valid circuit build time data out of 1000 times, 3 modes".

Every couple of mins script was sending hup signal to tor and in log file there are just 3 lines repeating:

Received reload signal (hup). Reloading config and resetting internal state.
Read configuration file ...........
Tor 0.3.3.9 (git-ca1a436fa8e53a32) opening log file.

And sometime this appears in random order:

Your network connection speed appears to have changed. Resetting timeout to 60s after 18 timeouts and 143 buildtimes.

but above line could be related v2 hidden service running because that line is regular in log file when v2 domains are active. Then almost 5 hours later new lines began to appear (hundreds probably due to large number of v3 domains in torrc):

Failed to launch rendezvous circuit to $835BDF974D1B6A324C15C48A95D49605AAAB51C1~ at ....
Giving up on launching a rendezvous circuit to $D405FCCF06ADEDF898DF2F29C9348DCB623031BA~ at 5.45.111.149 for hidden service v3onion....................
Giving up launching first hop of circuit to rendezvous point $A379AD9020BB119B53785888DACC63C7...........

and occasionaly

Extremely large value for circuit build timeout: 123s. Assuming clock jump. Purpose 14 (Measuring circuit timeout)
Couldn't relaunch rendezvous circuit to '$0299F268FCA9D55CD157976D39AE92B4B.........

Server was running only couple of v2 domains and they worked fine and there were also over 100 v3 domains which did not work at all. After restarting tor instance, v3 domains started to work fine (nothing else have been changed). All v3 domains were generated (added) by tor instance reload, not restart.

Have not found any "No circuits are opened" lines. At start, v3 domains worked fine for several hours.

Aug 07 00:02:15.000 [notice] Our hidden services received 41808 v2 and 18 v3 INTRODUCE2 cells and attempted to launch 42259 rendezvous circuits.
........
Aug 07 06:02:15.000 [notice] Our hidden services received 42361 v2 and 1813 v3 INTRODUCE2 cells and attempted to launch 45315 rendezvous circuits.
........
Aug 07 12:02:15.000 [notice] Our hidden services received 42549 v2 and 4072 v3 INTRODUCE2 cells and attempted to launch 48084 rendezvous circuits.
........
Aug 07 18:02:16.000 [notice] Our hidden services received 43651 v2 and 13413 v3 INTRODUCE2 cells and attempted to launch 60870 rendezvous circuits.
Last edited 4 months ago by cstest (previous) (diff)

comment:4 Changed 4 months ago by cstest

Last edited 4 months ago by cstest (previous) (diff)

comment:5 in reply to:  3 ; Changed 4 months ago by mikeperry

Replying to cstest:

And sometime this appears in random order:

Your network connection speed appears to have changed. Resetting timeout to 60s after 18 timeouts and 143 buildtimes.

This line is not consistent with the lines about "No valid circuit build time data". Unless you perhaps see several "Your network connection speed appears to have changed", and then they stop, and *then* you see the "No valid circuit build time data"?

Whatever you are doing differently for this to appear vs not appear (if anything) would be useful info.

but above line could be related v2 hidden service running because that line is regular in log file when v2 domains are active. Then almost 5 hours later new lines began to appear (hundreds probably due to large number of v3 domains in torrc):

Failed to launch rendezvous circuit to $835BDF974D1B6A324C15C48A95D49605AAAB51C1~ at ....
Giving up on launching a rendezvous circuit to $D405FCCF06ADEDF898DF2F29C9348DCB623031BA~ at 5.45.111.149 for hidden service v3onion....................
Giving up launching first hop of circuit to rendezvous point $A379AD9020BB119B53785888DACC63C7...........

These all indicate that you can't actually build paths, which is different than attempting circuits. This could be caused by the use of ExcludeNodes, HSLayer2/3, vanguards, or anything else that restricts paths. Are you using any options like that?

If not, it could be a result of n_circuit_failures being high (which could be caused by your timeout issues). But I don't know why your timeout failures are sometimes causing you to reset your timeout (the expected behavior) vs sometimes causing invalid data.

and occasionaly

Extremely large value for circuit build timeout: 123s. Assuming clock jump. Purpose 14 (Measuring circuit timeout)
Couldn't relaunch rendezvous circuit to '$0299F268FCA9D55CD157976D39AE92B4B.........

Now this is even weirder.. It means libevent didn't manage to call its seconds_elapsed_callback() for like 63 seconds, at least. Though I think a very large oneshot NTP update could do this.

Server was running only couple of v2 domains and they worked fine and there were also over 100 v3 domains which did not work at all. After restarting tor instance, v3 domains started to work fine (nothing else have been changed). All v3 domains were generated (added) by tor instance reload, not restart.

Does the issue only happen when the v2 domains are disabled/inactive?

Do the v3 domains have to be in use for this to happen, or can I reproduce by just making 100 of them on a random tor client and letting them sit mostly idle?

comment:6 in reply to:  5 Changed 4 months ago by cstest

Server was using v2 domains and because some attacker on tor level might be doing some tor wide non effective DDoS attack or trying to deanonymize users, during last couple of months I am finding lots of below log lines:

Hidden service ............. exceeded launch limit with 10 intro points in the last 11 seconds. Intro circuit launches are limited to 10 per 300 seconds. [461 similar message(s) suppressed in last 300 seconds]

That's why I have tried v3 services because v3 domains are not listed publicly.

In my torrc there are no ExcludeNodes or anything else that could have impact on path. Anyway, v2 domains were and are working fine. No configuration changes, just removed "HiddenServiceVersion 3" when generating new domains.

Almost all those weird log lines started to appear after CBT warning, except above mentioned "exceeded launch limit with 10 intro points" and "Your network connection speed appears to have changed." which is related to probable v2 attack as I have said above.

This issue happens only when v3 domains are activated, together with couple of v2 domains. There are no problems when only v2 domains are active. And it does not happen immediately but several hours after activating v3 domains. To repeat this it is not required only to generate 100 v3 domains but users must keep trying to access them.

As you can see in

Aug 07 00:02:15.000 [notice] Our hidden services received 41808 v2 and 18 v3 INTRODUCE2 cells and attempted to launch 42259 rendezvous circuits.
........
Aug 07 06:02:15.000 [notice] Our hidden services received 42361 v2 and 1813 v3 INTRODUCE2 cells and attempted to launch 45315 rendezvous circuits.
........
Aug 07 12:02:15.000 [notice] Our hidden services received 42549 v2 and 4072 v3 INTRODUCE2 cells and attempted to launch 48084 rendezvous circuits.
........
Aug 07 18:02:16.000 [notice] Our hidden services received 43651 v2 and 13413 v3 INTRODUCE2 cells and attempted to launch 60870 rendezvous circuits.

number of new v2 introduce cells was significantly low because during that period I have had only couple of v2 domains and number of v3 introduce cells increased even tho not a single user could connect through v3 domain after CBT issue. CBT issue appeared on "Aug 07 04:45:49.000".

System clock was and is correct.

Last edited 4 months ago by cstest (previous) (diff)

comment:7 Changed 4 months ago by mikeperry

Ok based on the above, my suspicions are accumulating around some kind of interaction between timed out circuits and the circuit failure limit, which would explain why your services stop working. I do not know where this CBT error is coming from though. #27049 is one way that it can happen, but if there are no loglines related to that, then something else must be happening.

I also do not know why this limit is only happening when v3 services are enabled. Perhaps there is some v3-specific code that counts failures differently than v2.

Are you able to run branches with additional logging and/or potential fixes? I have been unable to reproduce this on a v3-only service that is under continual use via synthetic wget activity.

Last edited 4 months ago by mikeperry (previous) (diff)

comment:8 Changed 4 months ago by mikeperry

After digging a bit more, I have to admit I'm still at a loss here as to the root cause. I have one last question: Did you see any log lines with "Tor has not observed any network activity"?

In the meantime, I can do the following things in a new branch if you can run it in an environment that triggers the bug:

  1. Ensures that when that "Bug: Could not determine largest build times" line happens, we reset back to the default timeout and clear our timeout history.
  2. Emits log lines when Tor has decided not to build circuits.

The first one should address your symptoms and allow v3 circuits to continue (I think), but the root cause of however we're mishandling v3 circuit timeout record keeping will still be there.

By itself, that first one may be something we want to backport anyway, since who knows what other deep mysteries surround CBT record keeping, or might in the future.

comment:9 Changed 4 months ago by cstest

Prior to CBT bug line here are warning lines that have happened after enabling v3 domains.

Aug 07 00:02:15.000 [notice] Our hidden services received 41808 v2 and 18 v3 INTRODUCE2 cells and attempted to launch 42259 rendezvous circuits.
(this is approx time when I have enabled v3 domains and for several hours both v3 and v2 domains were working together)


Aug 07 03:01:23.000 [warn] Error launching circuit to node $E0EAA536856FBD3C3AB46C2BAA398E7CDFDAFF5D~as44194a36s01 at 77.87.50.6 for service v2domain...
Aug 07 03:01:23.000 [warn] Error launching circuit to node $B247BA9E0AEA93E6D7BF4080CFBB964034AF2B28~kitten2 at 163.172.218.10 for service v2domain...
Aug 07 03:01:23.000 [warn] Error launching circuit to node $C793A33F28826D1F184504312C668EA8157AEF28~Unnamed at 217.182.95.221 for service v2domain...
Aug 07 03:01:23.000 [warn] Error launching circuit to node $71DA45C2AF1551F32CA83E2AD2D5ADA684505391~Unnamed at 34.241.138.212 for service v2domain...
Aug 07 03:01:23.000 [warn] Error launching circuit to node $F3FA0D45D35E987484848345F8D92AB1D883889B~Hermes at 86.29.190.220 for service v2domain...
Aug 07 03:01:23.000 [warn] Error launching circuit to node $4204C1B166CCE784CF38B02E4A0C94640A2BECA2~TheEpTic2 at 51.15.51.36 for service v2domain...
Aug 07 03:01:23.000 [warn] Error launching circuit to node $DB510E2AA40A462AD77DB83617F6F340E234D952~Unnamed at 206.189.24.66 for service v2domain...
Aug 07 03:01:23.000 [warn] Error launching circuit to node $74D6D1E8BF5159066768F48C1EA4951C9C97BEF9~lara at 148.251.42.164 for service v2domain...
(above lines are just for one v2 domain but they have been repeated for other v2 domains too)


Aug 07 03:01:41.000 [notice] Your network connection speed appears to have changed. Resetting timeout to 60s after 18 timeouts and 409 buildtimes.
Aug 07 03:01:43.000 [warn] Failed to find node for hop #1 of our path. Discarding this circuit.
Aug 07 03:01:44.000 [notice] Our circuit 0 (id: 641032) died due to an invalid selected path, purpose Hidden service: Establishing introduction point. This may be a torrc configuration issue, or a bug.
(all lines in torrc are simple except one line "MaxClientCircuitsPending 256")


Aug 07 03:12:39.000 [notice] Your Guard loki ($5A6451D4E4B4FFDE0B2682D8D8DAA0D10A500066) is failing more circuits than usual. Most likely this means the Tor network is overloaded. Success counts are 184/267. Use counts are 132/132. 184 circuits completed, 0 were unusable, 0 collapsed, and 2 timed out. For reference, your timeout cutoff is 463 seconds.
Aug 07 03:13:43.000 [warn] Giving up launching first hop of circuit to rendezvous point $5FA794BF301AF35E4AED185E939CBA9D2CFC2360~$5FA794BF301AF35E4A at 192.42.132.106 for service v2....
Aug 07 03:13:43.000 [warn] Giving up launching first hop of circuit to rendezvous point $7961C9991F022C8A363FD440CA395D47DB5D44D5~$7961C9991F022C8A36 at 51.254.35.151 for service v2..
Aug 07 03:13:43.000 [warn] Giving up on launching a rendezvous circuit to $DBAD17D706E2B6D5D917C2077961750513BDF879~ at 109.236.90.209 for hidden service v3....
Aug 07 03:13:43.000 [warn] Giving up on launching a rendezvous circuit to $C13D13DE33C2DAB4AF849014ECAB962A328D6806~ at 85.10.216.118 for hidden service v3....
Aug 07 03:13:44.000 [warn] Giving up on launching a rendezvous circuit to $372082F3E01DE6A6333D30329C6903A19D0E8E87~ at 176.193.113.234 for hidden service v3....
("giving up" lines were repeating aprox 50 times for the same 4-5 v3 domains and 2 v2 domains)


Aug 07 03:15:31.000 [warn] Failed to find node for hop #1 of our path. Discarding this circuit.
Aug 07 03:15:37.000 [notice] Extremely large value for circuit build timeout: 157s. Assuming clock jump. Purpose 14 (Measuring circuit timeout)
Aug 07 03:15:37.000 [notice] Extremely large value for circuit build timeout: 147s. Assuming clock jump. Purpose 14 (Measuring circuit timeout)
Aug 07 03:15:37.000 [notice] Extremely large value for circuit build timeout: 157s. Assuming clock jump. Purpose 14 (Measuring circuit timeout)
Aug 07 03:15:37.000 [notice] Extremely large value for circuit build timeout: 154s. Assuming clock jump. Purpose 14 (Measuring circuit timeout)
(105 lines of "Extremely large value for circuit build timeout")
Aug 07 03:15:37.000 [notice] Your network connection speed appears to have changed. Resetting timeout to 60s after 18 timeouts and 1000 buildtimes.


Aug 07 04:45:49.000 [warn] circuit_build_times_update_alpha(): Bug: Could not determine largest build time (0). Xm is 7525ms and we've abandoned 996 out of 1000 circuits. (on Tor 0.3.3.9 )

There are no log lines as in #27049. No "Tor has not observed any network activity" lines.

I do not know what triggers this bug so I can't repeat it but this happened two times.

Could that "MaxClientCircuitsPending 256" in torrc have any influence on this?

Last edited 4 months ago by cstest (previous) (diff)

comment:10 Changed 4 months ago by mikeperry

Ok I have a new theory, writing it here for the record:

  1. Circuit build times are being recorded while the default high timeout value (and high close timeout) is used. Lots of circuits take a while to complete, but successfully do so.
  2. Eventually, a low timeout value is learned.
  3. A SIGHUP arrives, which causes CBT to reload from the state file. While doing this, old compatibility code activates, and converts many/most of your previous timeout values to "abandoned" values.
  4. Finally, enough circuits time out again such that everything in the array is "abandonded".

This set of events would be consistent with your loglines, especially if you have no lines from #27049.

This still does not explain why this is v3 only, though.

Is the SIGHUP that you are sending only being done when v3 is enabled? Can you not SIGHUP Tor and see if the issue persists, or is there some reason you need to HUP it?

If for some reason you must SIGHUP Tor, I have a branch that disables this compatibility code upon SIGHUP, as well as resets CBT when these warns happen. The CBT reset *should* allow things to continue to work in that case (unless the connectivity/hang bug is not related to CBT at all):

https://github.com/mikeperry-tor/tor/tree/bug27066-cbtreset
https://github.com/mikeperry-tor/tor/archive/bug27066-cbtreset.zip

comment:11 Changed 4 months ago by mikeperry

Status: newneeds_information

I need to know if not sending regular SIGHUPs stops this bug from happening. If it does, then that branch is a fix. Otherwise it is just a stopgap.

comment:12 Changed 3 months ago by cstest

SIGHUPs are sent always due to required new domain generations.

I have just reported Tor bug causing v3 service to stop working.
https://trac.torproject.org/projects/tor/ticket/27410

That Tor instance (server running 3 Tor instances) is also using SIGUPs but I will disable it.

Note: See TracTickets for help on using tickets.