Opened 5 years ago

Closed 5 years ago

#11553 closed enhancement (fixed)

Be more verbose when warning about not finding a circuit ID

Reported by: ra Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.5.3-alpha
Severity: Keywords: tor-relay, 024-backport, 023-backport, andrea-review-0254, 025-triaged, nickm-backport-02422
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

On IRC:

"[WARN] failed to get unique circID. [99 duplicates hidden]
[WARN] No unused circ IDs. Failing.

24/7 for days now. filling up the logs quite fast. Deosnt seem to have any effect on the relays speed. didnt find any solid explanation for what causes this error.

[WARN] failed to get unique circID.
then this
[WARN] No unused circ IDs. Failing."

It is hard to say why this error occurs without more verbose logging output.

Child Tickets

Attachments (1)

no_circuit_ids_left_verbose.patch (1009 bytes) - added by ra 5 years ago.
More verbose warning messages.

Download all attachments as: .zip

Change History (32)

Changed 5 years ago by ra

More verbose warning messages.

comment:1 Changed 5 years ago by ra

Patch has not been tested!

comment:2 Changed 5 years ago by cypherpunks

Patch has not been tested!

"(Possible DoS from client)."

More likely "to" client, if someone tries to DDoS some old HS. eh. wrong. Don't know how it can be triggered for chan->is_client either.

This warnings need to be verbose only if you want to verify code bugs. If you sure no more bugs than evil attacker then better to mute this warns.

Last edited 5 years ago by cypherpunks (previous) (diff)

comment:3 Changed 5 years ago by nickm

Keywords: tor-relay 024-backport added
Milestone: Tor: 0.2.5.x-final
Priority: trivialmajor

Agreed; my best guess for explanation here would be that on some relay-to-relay connection, you're getting close to the the limit of 215 circuit IDs in each direction that Tor 0.2.3 and earlier had.

We should never be calling this function on a connection with a client, since we should never be trying to extend a circuit towards a client.

Also I bet this is a very expensive check if we're really exhausting the circID space here. So, upgrading this to "major" and marking 024-backportable.

comment:4 Changed 5 years ago by nickm

Keywords: 023-backport added

comment:5 Changed 5 years ago by nickm

The other possibility here is that our circuit_id_in_use code that we added to solve #7912 has a flaw somewhere, and circuitIDs are under some circumstances being marked as unusable but never getting made available again.

comment:6 Changed 5 years ago by nickm

Status: newneeds_review

My branch bug11553_024 has an improved warning, plus an improved (?) randomized algorithm to handle exhaustion scenarios. This algorithm makes its users distinguishable by guards from older clients, so if it goes in a stable release, it should go in the same release as something like #11438.

My branch bug11553_025 merges that forward to 0.2.5, and includes and attempt at diagnosing whether the #7912 fix could be responsible.

comment:7 Changed 5 years ago by cypherpunks

Btw, proposal 214 or/and #7351 ticket never was mentioned in Changelog nor ReleaseNotes for new 0.2.4.x. Silently implemented.

comment:8 Changed 5 years ago by nickm

(Whoops. Looks like we forgot a changes file? Opening another ticket for that.)

comment:9 Changed 5 years ago by cypherpunks

My branch bug11553_025

What about warn:

log_warn(LD_CIRC,"failed to get unique circID.")

comment:10 Changed 5 years ago by nickm

Thanks; added a rate-limiter.

comment:11 Changed 5 years ago by ra

A few comments/questions from my side:
a) The new log message should be fine.
b) Limiting the number of iterations to find a usable circuit ID in the worst case to 64 instead of 215 or 231 sounds also good. Having non-predictable circuit IDs sounds like a wise move to me anyway. Nevertheless, although the chance of not finding a usuable circuit ID for a specific channel within 64 tries is quite small, it will eventually happen and people will end up having warning messages in their log files; which is perfectly fine, as long as this happens very infrequently. I think it is really bad to have ignorable warning messages, but unfortunately I have no better suggestion at this time on how to handle this but stating in the log message that the warning can be ignored if it happens only very rarely.
c) Defining MAX_CIRCID_ATTEMPTS 64 seems like a good choice to me. But there is a comment missing in the code explaining why the constant is 64 and not 23, 42, 128 or 1024.
d) Is a warning only printed once for a channel (chan->warned_circ_ids_exhausted)? If so, why?

Last edited 5 years ago by ra (previous) (diff)

comment:12 Changed 5 years ago by nickm

b) I think that if this does turn out to be ignorable, we can downrate it even further.

c) Hm. I don't have a great justification. For any good N, N/2 and N*2 are probably good choices too. Basically, I wanted a number that wasn't too high, but which still gave a pretty small chance of failure when the space started to fill up. For N=64, there's a one-in-a-million failure chance when the space is 80% full, and a one-in-850 chance when the space is 90% full, and a one-in-26 chance when the space is 95% full. That seemed okay.

d) So as to avoid spamming the log with junk. I guess we could refine it to rate-limit it per channel.

ra, are you able to test this patch over the weekend? I'd like to know what kind of results it reports, and whether it indicates a possible bug in the #7351 code.

comment:13 Changed 5 years ago by nickm

ra, are you able to test this patch over the weekend?

And by "this patch" I mean the bug11553_025 version of it.

comment:14 Changed 5 years ago by andrea

Begin code review:

  • New logging in bb9b4c37f8e7f5cf78918f382e90d8b11ff42551 looks okay, but perhaps we should consider resetting warn_circ_ids_exhausted if the number of circuits drops or if sufficient time elapses since it was set, so we don't block warnings on that channel forever even if the exhaustion condition is relieved later?
  • 0d75344b0e0eafc89db89a974e87b16564cd8f0a:
    • This code looks okay, but the possibility of occasional false positives on circuit ID exhaustion amplifies my above-stated concern about never letting the blockage on the log message age out.
    • We could be smarter about this with better data structures; suppose, for example, the circuit IDs in use per channel lived in a balanced tree with each node knowing how many nodes its subtree had. Then we could know how many circuit IDs are available to assign (N = max_circ_id - num_circ_ids from the root), pick a random i with 0 <= i < n, and then walk down the tree adjusting i as appropriate for which circuit IDs are already in use. This would assign a uniformly distributed random unused circuit ID in deterministic log time in the number of circuit IDs already assigned, and never fail.
  • 985deaaaf7b7397857e02206e89392e0ee101077 looks okay to me

comment:15 Changed 5 years ago by cypherpunks

What chance to fill test_circ_id with zeroes by crypto_rand? Non zero.
Non zero chance to get zero circ_id and to violate specification.

comment:16 in reply to:  15 Changed 5 years ago by andrea

Replying to cypherpunks:

What chance to fill test_circ_id with zeroes by crypto_rand? Non zero.
Non zero chance to get zero circ_id and to violate specification.

Good near-catch. Not actually going to cause it to try to use a zero circ ID since get_unique_circ_id_by_chan() returns zero to indicate failure, but that amounts to a lucky coincidence. Hitting the 2-15 or 2-31 probability to zero circ_id would cause it to exit the loop without actually making MAX_CIRCID_ATTEMPTS tries, so it increases the probability to fail to find a circuit ID. For the narrow circuit ID case, this is the dominant source of failures to find a circuit ID when N/max_range is less than about 0.85, if my analysis is correct. This code should check for the zero case in the loop condition, I think.

comment:17 Changed 5 years ago by nickm

(See #11553 too.)

comment:18 in reply to:  12 Changed 5 years ago by ra

Replying to nickm:

ra, are you able to test this patch over the weekend?

I am only the messenger and did not encounter this problem myself (Trac account registration is still broken/disabled). Unfortunately compiling and running git branches does not seem to be an option. Sorry. Would providing a log file help in this case?

Last edited 5 years ago by ra (previous) (diff)

comment:19 Changed 5 years ago by nickm

I've added fixes for the above-noted issues to the branches. Better now?

We could be smarter about this with better data structures [...]

I agree that better data structures could solve this, but I think the current approach is better for 0.2.4 and 0.2.5; implementing an appropriate tree structure seems like it would be more time-consuming and error-prone than we'd want for something we intend to backport.

comment:20 Changed 5 years ago by nickm

Keywords: andrea-review-0254 added

comment:21 Changed 5 years ago by nickm

Keywords: 025-triaged added

comment:22 Changed 5 years ago by andrea

This branch looks fine to merge for 0.2.4/0.2.5 now. I'll create a new ticket for 0.2.6 for the balanced tree thing.

comment:23 Changed 5 years ago by nickm

That "better data structures" ticket is #11595.

comment:24 Changed 5 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final

Merged to 0.2.5; marking backportable.

comment:25 Changed 5 years ago by nickm

Recommendation: backport if 0.2.5.4 turns out okay ; this is a potential CPU-eating and CPU-dos issue.

comment:26 Changed 5 years ago by nickm

Keywords: nickm-backport-02422 added

Adding a tag for tickets I think we should backport into 0.2.4.22. Omitting ones where I said "unsure"

comment:27 Changed 5 years ago by arma

Are you proposing to backport bug11553_024, i.e. with the new randomized circid selection? That sure seems like a feature to me ("it uses less cpu, especially in edge cases").

That said, I could get behind the very precision oriented change:

-      log_warn(LD_CIRC,"No unused circ IDs. Failing.");
+      log_warn(LD_CIRC,"No unused circIDs found on channel %s wide "
+               "circID support, with %u inbound and %u outbound circuits. "
+               "Failing a circuit.",
+               chan->wide_circ_ids ? "with" : "without",
+               chan->num_p_circuits, chan->num_n_circuits);

That way, if an 0.2.4.x relay operator hits this case again, we'll have more details than we had before. What would you think of that?

comment:28 Changed 5 years ago by nickm

The rationale for backporting the randomized CircID code it enables the "don't check 215 times in order to confirm that we're out of circuit IDs!" fix, which is a huge slowdown on connections that are, or have been maliciously made, heavily loaded.

comment:29 Changed 5 years ago by nickm

Current status: let's backport port that log message for 0.2.4.22

comment:30 Changed 5 years ago by nickm

I merged just bb9b4c37f8e7f5cf78918f382e90d8b11ff42551 into 0.2.4, for the logging tweak. Further backports are possible in later versions if warranted.

comment:31 Changed 5 years ago by arma

Resolution: fixed
Status: needs_reviewclosed

I'm closing, since this feature is in 0.2.5.x and maintaining three trees at once is no fun. Onward and upward!

Note: See TracTickets for help on using tickets.