Opened 7 years ago

Closed 2 years ago

#3825 closed defect (user disappeared)

HS intro points overloaded with CREATE cells cause connectivity failures

Reported by: atoruser Owned by: rransom
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: unspecified
Severity: Normal Keywords: tor-hs
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor: None

Description

I run several onion sites on various different dedicated servers.
These hidden services are all using Tor 0.2.1.30 (which I believe is the latest stable release at the time of writing).
They have been using this tor version for many months without issue.

Recently myself and others have had spotty results trying to access these services.
I can take 10-20 attempts always getting the error "[Notice] Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)".
If I watch the vidalia tor network map when its trying to connect, its building and closing circuits like crazy.
At the time I receive these errors, I am directly connected (no Tor) to the hidden service server via SSH and I can see that it is indeed up and running, and many people are accessing the site through Tor, but its not working for me.

Usually restarting the Tor process on the client makes it instantly work again, but other times it just wont work unless I restart Tor on the hidden service server.

So this appears to be a problem that happens over time, with a long running tor process.

These hidden services are all pretty well known and popular, so its possible this is some kind of denial of service attack through the Tor protocol, but I suspect it may be a bug in one of the recent Tor beta versions that many relays are running, since its only been the last 2-3 weeks I noticed this problem.

I am not the only one to have this problem, many of my site visitors are reporting the site being "down" during periods I know its up, and I can see from access logs that people were accessing it, so it wasn't down for everyone.

Sorry for the vagueness of this report, if there is something I can do to get further details please tell me.

Thanks.

Child Tickets

Change History (41)

comment:1 in reply to:  description Changed 7 years ago by rransom

Status: newneeds_information

Replying to atoruser:

I run several onion sites on various different dedicated servers.
These hidden services are all using Tor 0.2.1.30 (which I believe is the latest stable release at the time of writing).
They have been using this tor version for many months without issue.

Recently myself and others have had spotty results trying to access these services.
I can take 10-20 attempts always getting the error "[Notice] Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)".
If I watch the vidalia tor network map when its trying to connect, its building and closing circuits like crazy.
At the time I receive these errors, I am directly connected (no Tor) to the hidden service server via SSH and I can see that it is indeed up and running, and many people are accessing the site through Tor, but its not working for me.

This sounds like the client-side part of #1297, which seems to have been fixed. Are you using a version of Tor 0.2.2.x before 0.2.2.28-beta? If so, upgrade to 0.2.2.31-rc or 0.2.2.32.

comment:2 Changed 7 years ago by atoruser

The hidden service server only uses 0.2.1.30.
For the client I have tried 0.2.1.30 and 0.2.2.30-rc and both have this problem.

I will try 0.2.2.32, but its strange that I have been using 0.2.1.30 for months (since its release) and this problem has only happened recently.
Well to be specific its always has occurred once in a while, but recently its almost every time, its become unbearable.

comment:3 Changed 6 years ago by rransom

Owner: set to rransom
Priority: normalmajor
Status: needs_informationassigned

Upgrading the hidden service server to 0.2.2.32 may help.

comment:4 Changed 6 years ago by atoruser

Even after upgrading the hidden service and the client to 0.2.2.32 this problem continues to happen.

Here is the log of my application trying to connect to the hidden service which I know is up.

Sep 11 08:31:05.805 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:34:41.946 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:35:01.956 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:35:50.986 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:35:58.991 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:39:01.096 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:43:01.233 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:44:29.284 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:44:43.293 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:49:01.440 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:49:25.454 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 08:49:55.472 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 09:07:01.094 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 09:25:01.919 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 09:25:46.337 [notice] Based on 1000 circuit times, it looks like we don't need to wait so long for circuits to finish. We will now assume a circuit is too slow to use after waiting 10 seconds.
Sep 11 09:26:22.987 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 09:27:01.373 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 09:43:20.397 [notice] Based on 1000 circuit times, it looks like we need to wait longer for circuits to finish. We will now assume a circuit is too slow to use after waiting 11 seconds.
Sep 11 09:45:02.015 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 10:03:02.690 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 10:03:37.708 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 10:05:02.764 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)

At 10:06 I restarted the tor process and then it connected fine.

comment:5 in reply to:  4 Changed 6 years ago by rransom

Status: assignedneeds_review

Replying to atoruser:

Even after upgrading the hidden service and the client to 0.2.2.32 this problem continues to happen.

We already knew that Tor clients would have trouble connecting to hidden services which can't connect to the client's rendezvous point before the client times out (see #1297). I suggested upgrading the hidden service server to Tor 0.2.2.32 in the hope that, if that had been the problem, the adaptive circuit-build-timeout code on the 0.2.2.x branch would allow your hidden service to reach more of its clients more quickly. But it looks like that wasn't your problem at all.

Here is the log of my application trying to connect to the hidden service which I know is up.

Sep 11 08:31:05.805 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)

(13 lines deleted)

Sep 11 09:25:46.337 [notice] Based on 1000 circuit times, it looks like we don't need to wait so long for circuits to finish. We will now assume a circuit is too slow to use after waiting 10 seconds.
Sep 11 09:26:22.987 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 09:27:01.373 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Sep 11 09:43:20.397 [notice] Based on 1000 circuit times, it looks like we need to wait longer for circuits to finish. We will now assume a circuit is too slow to use after waiting 11 seconds.
Sep 11 09:45:02.015 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)

(3 lines deleted)

At 10:06 I restarted the tor process and then it connected fine.

It looks like a relay serving as one of your hidden service's introduction points when your client first fetched the HS descriptor at around 08:31 became unreachable. Tor clients currently do not detect that they are failing to use an intro point unless the failure occurs after the client has finished building a circuit to it; if they cannot build a circuit to the intro point at all, they will never give up on using that intro point to connect to the hidden service. See bug3825a-v4 ( https://git.torproject.org/rransom/tor.git bug3825a-v4 ) for a fix for this client-side bug.

comment:6 Changed 6 years ago by rransom

bug3825a-v4 needs to be tested on 0.2.3.x before it can be merged to 0.2.2.x, but I expect 0.2.3.x to stay broken enough in other ways that we will want to apply this to 0.2.2.x after a month or so of testing.

comment:7 Changed 6 years ago by nickm

Hm. So walk me through the new timeout logic? If an intro point always times out, we could try it every single time that we get a request to connect to a hidden service, and never conclude that it's broken. Is that right?

The rest of this looks right to me.

comment:8 in reply to:  7 Changed 6 years ago by rransom

Milestone: Tor: 0.2.2.x-final

Replying to nickm:

Hm. So walk me through the new timeout logic? If an intro point always times out, we could try it every single time that we get a request to connect to a hidden service, and never conclude that it's broken. Is that right?

Yes. I assume that timeouts are almost always transient failures, usually on the client side, and that the client's circuit-build timeout is short enough that if every intro point really is broken, the client will try all of them and either successfully connect to the service or fail and fetch a new descriptor for it.

If you don't approve of that timeout behaviour, there are at least two other options (not mutually exclusive) for handling introduction circuit timeouts:

  • Count the number of times we have timed out while trying to reach an intro point, and remove the intro point after three timeouts while trying to reach it. This change would fit right in on this branch, but I don't expect it to help clients.
  • Allow an introduction circuit to continue trying to reach and use an intro point even if the circuit times out. This change would belong on the #1297 branch, because it involves mucking with circuit purposes, and so will the other remaining changes for #1297.

The rest of this looks right to me.

It wasn't. rend_client_refetch_v2_renddesc assumed that any HS descriptor a client had in its cache was still usable, and this branch invalidated that assumption. (I've checked the other callers of rend_cache_lookup_entry now, and they didn't make that assumption.)

See hs-fixes-2011-09-29-01 ( https://git.torproject.org/rransom/tor.git hs-fixes-2011-09-29-01 ) for a fixed #3825 and #3335 branch; see bug3825a-v5 ( https://git.torproject.org/rransom/tor.git bug3825a-v5 ) for a #3825-only branch with ‘Refetch an HS's desc if we don't have a usable one’ rebased to the beginning of the branch.

I still think we will need to apply this branch to 0.2.2.x after it has been tested for a while on 0.2.3.x -- currently, when one of atoruser's hidden services attracts too many circuits to one of its intro points, its would-be clients DoS that intro-point relay with a flood of EXTEND cells until their users either give up on connecting to the HS or flush their Tor clients' HS descriptor cache.

comment:9 Changed 6 years ago by nickm

Merged to master. Please double-check the merge commit and the followup "make it compile again" commits

comment:10 in reply to:  9 Changed 6 years ago by rransom

Replying to nickm:

Merged to master. Please double-check the merge commit and the followup "make it compile again" commits

Looks good. Thanks!

comment:11 Changed 6 years ago by rransom

See also #4212 (which became an actual problem due to the changes made to fix this bug).

comment:12 Changed 6 years ago by rransom

Replying to rransom:

Oct 09 19:13:41.000 [info] circuit_expire_building(): Abandoning circ [redacted] (state 3:open, purpose 6)
Oct 09 19:13:41.000 [info] internal (high-uptime) circ (length 4): [redacted](open) [redacted](open) [redacted](open) [redacted](open)
Oct 09 19:13:41.000 [err] _circuit_mark_for_close(): Bug: circuitlist.c:1222: _circuit_mark_for_close: Assertion ocirc->rend_data failed; aborting.

circuit_expire_building can only close a circuit with this purpose with reason == END_CIRC_REASON_TIMEOUT. Which seemed odd...

The assertion occurred in the following chunk of code, added to help fix #3825:

 } else if (circ->purpose == CIRCUIT_PURPOSE_C_INTRODUCING &&
             reason != END_STREAM_REASON_TIMEOUT) {

... but END_CIRC_REASON_TIMEOUT is not equal to END_STREAM_REASON_TIMEOUT. Oooops.

comment:13 in reply to:  12 Changed 6 years ago by rransom

Replying to rransom:

... but END_CIRC_REASON_TIMEOUT is not equal to END_STREAM_REASON_TIMEOUT. Oooops.

See bug3825c ( https://git.torproject.org/rransom/tor.git bug3825c ) for a fix for this.

comment:14 Changed 6 years ago by atoruser

I tried with Tor 0.2.3.5-alpha and the problem still exists.

I don't think the explanation that one of the hidden service intro points is unreachable is correct, this is happening way to often and sometimes takes a few restarts before it will connect. I use hidden services alot and it seems too much a coincidence that all hidden services suddenly started picking unreliable intro points (this problem only started about 3 months ago).
Anyways doesn't tor try to connect to two of the intro points simultaneously?

comment:15 in reply to:  14 Changed 6 years ago by rransom

Summary: Hidden service unavailable even though I know its upHS intro points overloaded with CREATE cells cause connectivity failures

Replying to atoruser:

I tried with Tor 0.2.3.5-alpha and the problem still exists.

I'm not surprised. Tor 0.2.3.5-alpha doesn't even have the client-side changes that might start to fix this bug.

I don't think the explanation that one of the hidden service intro points is unreachable is correct, this is happening way to often and sometimes takes a few restarts before it will connect.

If one of a hidden service's intro points is overloaded with CREATE cells, the client-side fixes I wrote for this ticket will cause the client to give up on that intro point permanently, but consider retrying intro points that failed due to the client's circuit-build timeout. That should at least give a second attempt to connect to the HS a chance of working.

Other issues that we know cause hidden service connectivity problems include:

  • #1297 (service-side)
  • #3335 (now fixed in Git master; the fix needs to be shipped in a 0.2.3.x release for testing, then shipped on 0.2.2.x, ASAP)
  • the service-side part of #3825 (popular hidden services should open more intro points, to spread the load a bit better)

See also #3460 (the most common cause of user HS-unreachability complaints); we need to fix it on the service side, and my plan for how to fix it determines how #3825 (service-side) should be fixed.

I use hidden services alot and it seems too much a coincidence that all hidden services suddenly started picking unreliable intro points (this problem only started about 3 months ago).

It's not a matter of hidden services picking unreliable relays for use as intro points. The problem is that being chosen as one of a popular hidden service's intro points makes a relay unreliable, because clients start extending many more circuits to HS intro points, and due to the client-side part of this bug, when the relay starts to become overloaded, the clients respond by overloading it harder.

Currently, I'm working on code that will (hopefully) help popular HSes spread their intro points (and the flood of client requests they produce) across more relays.

Anyways doesn't tor try to connect to two of the intro points simultaneously?

No. You may be mistaking the client's rendezvous circuit for another intro circuit.

comment:16 Changed 6 years ago by nickm

Merged bug3825c to master.

comment:17 Changed 6 years ago by rransom

Status: needs_reviewassigned

Moving back to ‘assigned’.

comment:18 Changed 6 years ago by rransom

Milestone: Tor: 0.2.2.x-finalTor: 0.2.3.x-final

comment:19 Changed 6 years ago by rransom

Status: assignedneeds_review

See bug3825b-v8 ( https://git.torproject.org/rransom/tor.git bug3825b-v8 ) for a service-side change that may help fix this bug.

comment:20 Changed 6 years ago by nickm

re bug3825b-v8 :

The only documented error I can find that you can get from time() is EFAULT, which shouldn't be possible for time(NULL).

Never float; always double.

In a (Double / int) calculation, you shouldn't need to cast the int to double; see the C standard, or K&R, or wherever.

The fractional_n_intro_points_wanted_to_replace_this_one calculation is tricky, and the extra parens for the casts make it harder to read. It really needs a comment to explain what it's supposed to be calculating and why

comment:21 in reply to:  20 Changed 6 years ago by rransom

Replying to nickm:

re bug3825b-v8 :

The only documented error I can find that you can get from time() is EFAULT, which shouldn't be possible for time(NULL).

Fixup pushed.

Never float; always double.

Fixup pushed.

In a (Double / int) calculation, you shouldn't need to cast the int to double; see the C standard, or K&R, or wherever.

Fixup pushed.

The fractional_n_intro_points_wanted_to_replace_this_one calculation is tricky, and the extra parens for the casts make it harder to read. It really needs a comment to explain what it's supposed to be calculating and why

Multiple fixups pushed for this one.

comment:22 Changed 6 years ago by nickm

Squashed and merged! Can we close this one now?

comment:23 Changed 6 years ago by rransom

There is one more change that might help fix this issue: we could make relays which are overloaded with CREATE cells close some intro-point circuits (specifically, any intro point which a client introduces to while the relay is overloaded). I'm pretty sure that this change would not make any hidden service less reachable, but it could put significantly more load on the HSDir relays for a hidden service.

comment:24 Changed 6 years ago by hellais

Taking a look at commit dae000735e75b178cdf27000d316f6504bf61373 I am a bit unsure about the reasoning behind the number of intro points to open once it realizes that one should be teared down.

Let me try and explain how I understand this new Tor behavior:

n is the original number of IP

If a Tor HS detects that a intro-point circuit is being overloaded by lot's of CREATE cells it will close that intro point.

At this point it will have n -1 IP active and I need to determine the x that I should add the n -1 to have the new number of intro points.

constants:
IP_MIN_LT = minimum lifetime in seconds of a IP (18 hours = 18*60*60)
IP_CON_LT = number of INTRODUCTION2 connections before the IP should dei (16384)

variables
time_since_publishing = time in seconds since the HS has been published to the DA

x = ((time_since_publishing/IP_MIN_LT)*(accepted_ip_connection)/(IP_CON_LT))*1.5

I have a bit of doubts of this choice since the two members of this function converge do 4/3 and 1 therefore x -> 2. Since you are assigning a double to int:

     n_intro_points_wanted_to_replace_this_one =
       fractional_n_intro_points_wanted_to_replace_this_one;

This will always be equal to one, so there will never be more than one new hidden service.

Another funny thing is that you are checking if (fractional_n_intro_points_wanted_to_replace_this_one < 0). When will this occur? In the previous formula you are dealing with integers that are strictly greater than 0 when will this ever happen?
It is not a bug, but it is strange to do such a thing.

I will split this issue into a few new tickets with some proposed solutions and better numbers for INTRO_POINT_LIFETIME_INTRODUCTIONS, INTRO_POINT_LIFETIME_MIN_SECONDS, INTRO_POINT_LIFETIME_MAX_SECONDS and NUM_INTRO_POINTS_MAX.

comment:25 in reply to:  24 ; Changed 6 years ago by rransom

Replying to hellais:

Taking a look at commit dae000735e75b178cdf27000d316f6504bf61373 I am a bit unsure about the reasoning behind the number of intro points to open once it realizes that one should be teared down.

Let me try and explain how I understand this new Tor behavior:

n is the original number of IP

If a Tor HS detects that a intro-point circuit is being overloaded by lot's of CREATE cells it will close that intro point.

At this point it will have n -1 IP active and I need to determine the x that I should add the n -1 to have the new number of intro points.

constants:
IP_MIN_LT = minimum lifetime in seconds of a IP (18 hours = 18*60*60)
IP_CON_LT = number of INTRODUCTION2 connections before the IP should dei (16384)

variables
time_since_publishing = time in seconds since the HS has been published to the DA

x = ((time_since_publishing/IP_MIN_LT)*(accepted_ip_connection)/(IP_CON_LT))*1.5

I have a bit of doubts of this choice since the two members of this function converge do 4/3 and 1 therefore x -> 2. Since you are assigning a double to int:

What does “converge” mean?

     n_intro_points_wanted_to_replace_this_one =
       fractional_n_intro_points_wanted_to_replace_this_one;

This will always be equal to one, so there will never be more than one new hidden service.

No.

Another funny thing is that you are checking if (fractional_n_intro_points_wanted_to_replace_this_one < 0). When will this occur? In the previous formula you are dealing with integers that are strictly greater than 0 when will this ever happen?

The difference between the current time and the time at which an introduction point was first published may be negative.

comment:26 in reply to:  25 ; Changed 6 years ago by hellais

Replying to rransom:

Replying to hellais:

Taking a look at commit dae000735e75b178cdf27000d316f6504bf61373 I am a bit unsure about the reasoning behind the number of intro points to open once it realizes that one should be teared down.

Let me try and explain how I understand this new Tor behavior:

n is the original number of IP

If a Tor HS detects that a intro-point circuit is being overloaded by lot's of CREATE cells it will close that intro point.

At this point it will have n -1 IP active and I need to determine the x that I should add the n -1 to have the new number of intro points.

constants:
IP_MIN_LT = minimum lifetime in seconds of a IP (18 hours = 18*60*60)
IP_CON_LT = number of INTRODUCTION2 connections before the IP should dei (16384)

variables
time_since_publishing = time in seconds since the HS has been published to the DA

x = ((time_since_publishing/IP_MIN_LT)*(accepted_ip_connection)/(IP_CON_LT))*1.5

I have a bit of doubts of this choice since the two members of this function converge do 4/3 and 1 therefore x -> 2. Since you are assigning a double to int:

What does “converge” mean?

Correct me if I am wrong, but from what I understand time_since_publishing -> INTRO_POINT_LIFETIME_MAX_SECONDS (24*3600) and accepted_ip_connection -> INTRO_POINT_LIFETIME_INTRODUCTIONS.

Is there any circumstance when accepted_ip_connection is >> INTRO_POINT_LIFETIME_INTRODUCTIONS AND time_since_publishing >> INTRO_POINT_LIFETIME_MAX_SECONDS?

     n_intro_points_wanted_to_replace_this_one =
       fractional_n_intro_points_wanted_to_replace_this_one;

This will always be equal to one, so there will never be more than one new hidden service.

No.

If my argument above is incorrect then maybe it might also be greater than 1, but it would still not be greater than 2. Is there a case in which this would not happen?

Another funny thing is that you are checking if (fractional_n_intro_points_wanted_to_replace_this_one < 0). When will this occur? In the previous formula you are dealing with integers that are strictly greater than 0 when will this ever happen?

The difference between the current time and the time at which an introduction point was first published may be negative.

comment:27 in reply to:  26 ; Changed 6 years ago by rransom

Replying to hellais:

Replying to rransom:

Replying to hellais:

Taking a look at commit dae000735e75b178cdf27000d316f6504bf61373 I am a bit unsure about the reasoning behind the number of intro points to open once it realizes that one should be teared down.

Let me try and explain how I understand this new Tor behavior:

n is the original number of IP

If a Tor HS detects that a intro-point circuit is being overloaded by lot's of CREATE cells it will close that intro point.

At this point it will have n -1 IP active and I need to determine the x that I should add the n -1 to have the new number of intro points.

constants:
IP_MIN_LT = minimum lifetime in seconds of a IP (18 hours = 18*60*60)
IP_CON_LT = number of INTRODUCTION2 connections before the IP should dei (16384)

variables
time_since_publishing = time in seconds since the HS has been published to the DA

x = ((time_since_publishing/IP_MIN_LT)*(accepted_ip_connection)/(IP_CON_LT))*1.5

I have a bit of doubts of this choice since the two members of this function converge do 4/3 and 1 therefore x -> 2. Since you are assigning a double to int:

Sorry. That isn't the formula I meant to use there.

See my bug3825c branch for a fix.

comment:28 in reply to:  27 ; Changed 6 years ago by hellais

Replying to rransom:

Replying to hellais:

Replying to rransom:

Replying to hellais:

Taking a look at commit dae000735e75b178cdf27000d316f6504bf61373 I am a bit unsure about the reasoning behind the number of intro points to open once it realizes that one should be teared down.

Let me try and explain how I understand this new Tor behavior:

n is the original number of IP

If a Tor HS detects that a intro-point circuit is being overloaded by lot's of CREATE cells it will close that intro point.

At this point it will have n -1 IP active and I need to determine the x that I should add the n -1 to have the new number of intro points.

constants:
IP_MIN_LT = minimum lifetime in seconds of a IP (18 hours = 18*60*60)
IP_CON_LT = number of INTRODUCTION2 connections before the IP should dei (16384)

variables
time_since_publishing = time in seconds since the HS has been published to the DA

x = ((time_since_publishing/IP_MIN_LT)*(accepted_ip_connection)/(IP_CON_LT))*1.5

I have a bit of doubts of this choice since the two members of this function converge do 4/3 and 1 therefore x -> 2. Since you are assigning a double to int:

Sorry. That isn't the formula I meant to use there.

See my bug3825c branch for a fix.

Ok. That makes *much* more sense. Though I am still dubious about the 1.5 factor and if that is a good usage factor.

Another thing that might be a problem is that NUM_INTRO_POINTS_MAX is not being used to compute the *total* maximum number of IP, but rather the maximum number of IP to be rebuilt once one dies.

This means that I could potentially overload a set of IP's and force them to always recreate 10 new ones and therefore lead to a very big number of IP (theoretically infinite).

Is this in your intentions? I believe we should pick 60 as the maximum number of IP that a HS is connected to (for reasoning of this choice check out: #4862).

line 1001 should read instead:

  • if (n_intro_points_wanted_now < NUM_INTRO_POINTS_DEFAULT) {

+ if ((current_intro_point_countr + n_intro_points_wanted_now) < NUM_INTRO_POINTS_DEFAULT) {

Do you agree?

comment:29 in reply to:  28 ; Changed 6 years ago by rransom

Replying to hellais:

Sorry. That isn't the formula I meant to use there.

See my bug3825c branch for a fix.

Ok. That makes *much* more sense. Though I am still dubious about the 1.5 factor and if that is a good usage factor.

The 1.5 is meant to bias it toward more intro points rather than less.

Another thing that might be a problem is that NUM_INTRO_POINTS_MAX is not being used to compute the *total* maximum number of IP, but rather the maximum number of IP to be rebuilt once one dies.

This means that I could potentially overload a set of IP's and force them to always recreate 10 new ones and therefore lead to a very big number of IP (theoretically infinite).

I clamp n_intro_points_wanted_to_replace_this_one to a maximum of NUM_INTRO_POINT_MAX only to avoid integer overflow when n_intro_points_wanted_now is updated. n_intro_points_wanted_now is also clamped to between NUM_INTRO_POINTS_DEFAULT and NUM_INTRO_POINTS_MAX, so no HS will establish more than a total of NUM_INTRO_POINTS_MAX intro points at a time.

Is this in your intentions? I believe we should pick 60 as the maximum number of IP that a HS is connected to (for reasoning of this choice check out: #4862).

60 is too many. The point of using 10 intro points is to ensure that the load is spread out enough to not overload the intro-point relays and cause buggy misdesigned clients to keep pounding the intro-point relays so they stay overloaded. If that client handle-congestion-by-DDoSing bug didn't exist, 6 intro points would be enough for any HS, because any client load capable of overloading the intro points would also overload the service's 6 HSDir nodes.

(If a service maintains too many intro points, it will load its HSDir nodes more by republishing its descriptor more often. In theory, we could put only a subset of a service's intro points into each of its 6 descriptors, but that would involve far more extensive changes than I want to implement unless we have evidence that they're really necessary.)

line 1001 should read instead:

  • if (n_intro_points_wanted_now < NUM_INTRO_POINTS_DEFAULT) {

+ if ((current_intro_point_countr + n_intro_points_wanted_now) < NUM_INTRO_POINTS_DEFAULT) {

Do you agree?

No. n_intro_points_wanted_now is the new total number of intro points for the service to maintain:

    log_info(LD_REND, "Replacing closing intro point for service %s "
             "with %d new intro points (wanted %g replacements); "
             "service will now try to have %u intro points",
             rend_service_describe_for_log(service),
             n_intro_points_really_replacing_this_one,
             fractional_n_intro_points_wanted_to_replace_this_one,
             n_intro_points_really_wanted_now);

    service->n_intro_points_wanted = n_intro_points_really_wanted_now;

comment:30 in reply to:  29 ; Changed 6 years ago by hellais

Replying to rransom:

Replying to hellais:

Sorry. That isn't the formula I meant to use there.

See my bug3825c branch for a fix.

Ok. That makes *much* more sense. Though I am still dubious about the 1.5 factor and if that is a good usage factor.

The 1.5 is meant to bias it toward more intro points rather than less.

Another thing that might be a problem is that NUM_INTRO_POINTS_MAX is not being used to compute the *total* maximum number of IP, but rather the maximum number of IP to be rebuilt once one dies.

This means that I could potentially overload a set of IP's and force them to always recreate 10 new ones and therefore lead to a very big number of IP (theoretically infinite).

I clamp n_intro_points_wanted_to_replace_this_one to a maximum of NUM_INTRO_POINT_MAX only to avoid integer overflow when n_intro_points_wanted_now is updated. n_intro_points_wanted_now is also clamped to between NUM_INTRO_POINTS_DEFAULT and NUM_INTRO_POINTS_MAX, so no HS will establish more than a total of NUM_INTRO_POINTS_MAX intro points at a time.

Is this in your intentions? I believe we should pick 60 as the maximum number of IP that a HS is connected to (for reasoning of this choice check out: #4862).

60 is too many. The point of using 10 intro points is to ensure that the load is spread out enough to not overload the intro-point relays and cause buggy misdesigned clients to keep pounding the intro-point relays so they stay overloaded. If that client handle-congestion-by-DDoSing bug didn't exist, 6 intro points would be enough for any HS, because any client load capable of overloading the intro points would also overload the service's 6 HSDir nodes.

(If a service maintains too many intro points, it will load its HSDir nodes more by republishing its descriptor more often. In theory, we could put only a subset of a service's intro points into each of its 6 descriptors, but that would involve far more extensive changes than I want to implement unless we have evidence that they're really necessary.)

Ok I guess this makes sense though I would like to have also some evidence that would support this number and not just be something that be think is good enough.

What tests do you think we could do to understand this? We can run them on tor2web with real users.

line 1001 should read instead:

  • if (n_intro_points_wanted_now < NUM_INTRO_POINTS_DEFAULT) {

+ if ((current_intro_point_countr + n_intro_points_wanted_now) < NUM_INTRO_POINTS_DEFAULT) {

Do you agree?

No. n_intro_points_wanted_now is the new total number of intro points for the service to maintain:

    log_info(LD_REND, "Replacing closing intro point for service %s "
             "with %d new intro points (wanted %g replacements); "
             "service will now try to have %u intro points",
             rend_service_describe_for_log(service),
             n_intro_points_really_replacing_this_one,
             fractional_n_intro_points_wanted_to_replace_this_one,
             n_intro_points_really_wanted_now);

    service->n_intro_points_wanted = n_intro_points_really_wanted_now;

You are correct.

Though I still have some doubts about the formula
(1.5 * ((intro_point_accepted_intro_count(intro) /

(double)INTRO_POINT_LIFETIME_INTRODUCTIONS) /

(((double)now - intro->time_published) /

INTRO_POINT_LIFETIME_MIN_SECONDS)));

By considering:

x = intro num
y = intro secs

usage = (x/(214))/((y)/(18*3600))

An these the domains of this function:
x = (0, 214)
y = (0, 18*3600)

By plotting it (http://www.wolframalpha.com/input/?i=1.5+*+%28x%2F%282%5E14%29%29%2F%28%28y%29%2F%2818*3600%29%29+x+from+0+to+2%5E14+and+y+from+0+to+18*3600) I see that by x getting bigger (the number of intro2 cells are more) the usage factor does not get bigger, but smaller.

We want that for x getting bigger and y getting smaller usage gets bigger correct?

Using that formula that is not the case.

comment:31 in reply to:  30 Changed 6 years ago by hellais

By plotting it (http://www.wolframalpha.com/input/?i=1.5+*+%28x%2F%282%5E14%29%29%2F%28%28y%29%2F%2818*3600%29%29+x+from+0+to+2%5E14+and+y+from+0+to+18*3600) I see that by x getting bigger (the number of intro2 cells are more) the usage factor does not get bigger, but smaller.

We want that for x getting bigger and y getting smaller usage gets bigger correct?

Sorry, but I believe I still need to recover from post-CCC. usage does in fact get bigger as x gets bigger.

comment:32 Changed 6 years ago by nickm

Can somebody summarize what at this point I should be reviewing for what? I'm a bit lost in the discussion above.

comment:33 in reply to:  32 Changed 6 years ago by rransom

Replying to nickm:

Can somebody summarize what at this point I should be reviewing for what? I'm a bit lost in the discussion above.

My bug3825c branch needs review; it fixes a bug I introduced in my last fixup commit on my bug3825b-v8 branch.

comment:34 Changed 6 years ago by nickm

Okay. I'd like to request the use of named intermediate temporary variables here; this expression is very hard to read.

Also, 1.5*(a/B)/(c/D) for B and D constant reduces nicely to 1.5*(a/c)*(the constant B/D). Is there a reason not to think of this expression this way?

comment:35 Changed 6 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Tweaked, contertweaked, squashed, merged. Thanks!

comment:36 Changed 6 years ago by atoruser

Resolution: fixed
Status: closedreopened

This is still not fixed.
Happened again today with 0.2.3.12-alpha.

[cropped 4 hours of this error repeated]
Apr 09 04:08:14.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Apr 09 04:08:33.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Apr 09 04:08:55.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Apr 09 04:09:29.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Apr 09 04:10:15.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:25. Giving up. (waiting for circuit)
Apr 09 04:14:39.000 [notice] Removed 7/9048 microdescriptors as old.
Apr 09 04:19:37.000 [notice] Interrupt: exiting cleanly.
Apr 09 04:19:55.000 [notice] Tor 0.2.3.12-alpha (git-800942b4176ca31c) opening log file.

I restarted Tor and it connected fine.

If it helps, sending the new identity signal with vidalia also fixes the problem instead of needing to restart Tor completely.

comment:37 Changed 6 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: unspecified
Status: reopenedneeds_information

Requires more investigation. If there's a bug, we can try to do the fix in 0.2.3--but if it's a hidden-service-side robustness feature we need, it'll have to wait for 0.2.4

comment:38 in reply to:  36 Changed 6 years ago by rransom

Replying to atoruser:

This is still not fixed.
Happened again today with 0.2.3.12-alpha.

I restarted Tor and it connected fine.

If it helps, sending the new identity signal with vidalia also fixes the problem instead of needing to restart Tor completely.

Did you need to restart the hidden-service server, or was sending NEWNYM to the client sufficient?

If you needed to restart the HS server, the most likely remaining cause of this is the adaptive CBT code setting a timeout so low that the server cannot build circuits to HSDir nodes. Nick has decided to not remove or disable adaptive circuit-build timeout, so if this is the problem, it won't be fixed.

If you only needed to NEWNYM the client, that's a different bug, and potentially fixable.

comment:39 Changed 5 years ago by nickm

Keywords: tor-hs added

comment:40 Changed 5 years ago by nickm

Component: Tor Hidden ServicesTor

comment:41 Changed 2 years ago by dgoulet

Resolution: user disappeared
Severity: Normal
Sponsor: None
Status: needs_informationclosed

3 years seeing no activity. Also, most of the code that is being discuss here has been removed (the adaptive intro points algorithm). I'm closing this. Please open a new ticket if this could still be an issue with more details.

Note: See TracTickets for help on using tickets.