Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#23097 closed defect (fixed)

The circuit timeout prediction is not working properly

Reported by: dgoulet Owned by:
Priority: Very High Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 031-backport, review-group-22
Cc: mikeperry Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by dgoulet)

During prop224 service testing (#20657), I've encountered a weird behavior. Tor will start closing circuits with circuit_expire_old_circuits_clientside() and then open new ones for internal use (prediction) but set their timeout to 1 sec which leads to closing the circuit 30 sec later (NewCircuitPeriod defaults to 30 sec).

Condensed log info:

Aug 03 19:59:30.000 [info] circuit_expire_old_circuits_clientside(): Closing circuit 320 that has been unused for 27997 msec.
Aug 03 19:59:30.000 [info] circuit_expire_old_circuits_clientside(): Closing circuit 318 that has been unused for 29997 msec.
Aug 03 19:59:30.000 [info] circuit_expire_old_circuits_clientside(): Closing circuit 319 that has been unused for 28979 msec.
Aug 03 19:59:30.000 [info] circuit_predict_and_launch_new(): Have 0 clean circs (0 internal), need another internal circ for my hidden service.
Aug 03 19:59:30.000 [info] origin_circuit_new(): Circuit 321 chose an idle timeout of 1 based on 0 seconds of predictive building remaining.
...
Aug 03 20:00:01.000 [info] circuit_expire_old_circuits_clientside(): Closing circuit 321 that has been unused for 30991 msec.

So notice the circuit 321 with a timeout of 1 sec and then being closed 30 sec later... Basically, it's looping like that non stop. What I think is happening is this:

predicted_ports_prediction_time_remaining() returns 0 so the following computation always results in 1 sec (in origin_circuit_new()):

    int prediction_time_remaining =
      predicted_ports_prediction_time_remaining(time(NULL));
    circ->circuit_idle_timeout = prediction_time_remaining+1+
        crypto_rand_int(1+prediction_time_remaining/20);

We call rep_hist_note_used_internal() in the hs subsystem when we launch intro point and rendezvous circuits. That function sets last_prediction_add_time = now. Which is what we want because then predicted_ports_prediction_time_remaining() computes a idle_delta that is below the prediction_timeout (set between 30 and 60 mins by default, see channelpadding_get_circuits_available_timeout()).

Which means that after let's say 61 min for a prediction_timeout = 60min, idle_delta becomes bigger than prediction_timeout and thus returning 0 everytime ultimately making every new circuit timeout to 1 sec. See why below:

int
predicted_ports_prediction_time_remaining(time_t now)
{
  time_t idle_delta = now - last_prediction_add_time;

  /* Protect against overflow of return value. This can happen if the clock
   * jumps backwards in time. Update the last prediction time (aka last
   * active time) to prevent it. This update is preferable to using monotonic
   * time because it prevents clock jumps into the past from simply causing
   * very long idle timeouts while the monotonic time stands still. */
  if (last_prediction_add_time > now) {
    last_prediction_add_time = now;
    idle_delta = 0;
  }

  /* Protect against underflow of the return value. This can happen for very
   * large periods of inactivity/system sleep. */
  if (idle_delta > prediction_timeout)
    return 0;
    [RIGHT HERE, we return 0]
...

This is pretty bad because it means that every HS that sees no client for at least 30 to 60 min, will enter this loop of create/close circuits raising flags at the Guard but also putting pressure on the network.

Seems that this was introduced with the channel padding in tor-0.3.1.1-alpha.

Child Tickets

Change History (12)

comment:1 Changed 2 years ago by dgoulet

Component: - Select a componentCore Tor/Tor
Description: modified (diff)
Keywords: 031-backport added
Milestone: Tor: 0.3.2.x-final
Priority: MediumVery High

comment:2 Changed 2 years ago by dgoulet

Cc: mikeperry added

comment:3 Changed 2 years ago by mikeperry

Ok. I think the core problem is that the decision to build a "predicted circuit" for hidden services is actually independent from the per-port and internal prediction code. This is not the case for any other circuits launched from circuit_predict_and_launch_new().

One option is to call rep_hist_note_used_internal() from the needs_hs_server_circuits() block in circuit_predict_and_launch_new(), since it seems like we always want to be predicting hidden service circuits continuously if an onion service is configured?

Another option is to define a base minimum circuit idle timeout, and use it every single time in origin_circuit_new().

Do we have a preference?

(I'm also noticing that for prop247, we're going to want to change how circuit_predict_and_launch_new() works anyway, since we'll need to predict hidden service circuits with their specific purpose ahead of time, instead of just making CIRCUIT_PURPOSE_C_GENERAL. Cannibalization is not going to work for us anymore).

comment:4 in reply to:  3 Changed 2 years ago by dgoulet

Replying to mikeperry:
[...]

One option is to call rep_hist_note_used_internal() from the needs_hs_server_circuits() block in circuit_predict_and_launch_new(), since it seems like we always want to be predicting hidden service circuits continuously if an onion service is configured?

I'm actually puzzled. Right now, we assume that we need to have 3 internal circuits at all time for the service (SUFFICIENT_UPTIME_INTERNAL_HS_SERVERS 3) but I don't think that is optimal.

First, we now have to consider 2 HS subsystem (v2 and v3). Second, a service doesn't really know how many circuits it needs because the number of clients connecting is arbitrary that is can be 1 once a day or 1000 every minute. In terms of IP circuit, we open 3 early and keep them for a while so prediction won't help us much imo.

Bottom line, the current value of 3 is kind of "whatever why?" and we should only care about RP circuits meaning need_capacity = 1 always.

That being said, for now and short term, I think we should do what you propose, in needs_hs_server_circuits() and always consider that we need capacity meaning: rep_hist_note_used_internal(now, 1, 1).

comment:5 Changed 2 years ago by mikeperry

Status: newneeds_review

Ok, the simple fix we discussed is in mikeperry/bug23097. Is there an easy way to repro this?

In the meantime I can let an idle HS run overnight with this code, and see if it still builds circuits 1x/sec or not. Will that do it?

comment:6 Changed 2 years ago by mikeperry

Ok I ran that branch (off of main-0.3.1) overnight with an idle hidden service and did not observe that logline from the patch, nor 1x/sec circuit building. Is that enough?

comment:7 Changed 2 years ago by nickm

Keywords: review-group-22 added

comment:8 Changed 2 years ago by dgoulet

Oh hey! So I had a commit in the #20657 branch that got merged two weeks ago that actually kind of does what you propose. See:

400ba2f6 N 2017-08-04 prop224: Always note down the use of internal circuit [David Goulet]

The note internal has been put in needs_hs_server_circuits().

Ok? Bad? And probably worth a change file if we keep it that way!

Version 0, edited 2 years ago by dgoulet (next)

comment:9 Changed 2 years ago by dgoulet

Status: needs_reviewneeds_information

comment:10 Changed 2 years ago by mikeperry

dgoulet - your version of the patch looks fine. I have no idea how to tell if the behavior is OK for v3 services though. I'm also not sure what you're saying. Is prediction disabled for them?

comment:11 Changed 2 years ago by dgoulet

Resolution: fixed
Status: needs_informationclosed

Ack. We are done here then.

comment:12 Changed 2 years ago by mikeperry

Ok, dgoulet says this should be fine for v3, too.

Note: See TracTickets for help on using tickets.