Opened 7 years ago

Closed 7 years ago

#7291 closed enhancement (fixed)

Investigate making MaxOnionsPending time-based

Reported by: mikeperry Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay
Cc: nickm, mo Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

A lot of relays seem to be failing circuits due to too many onionskins right now. Are there really too many? A machine with NumCPUs 8 might be able to churn through 100 onionskins pretty quickly. Perhaps we should consider making the queue limited based on time-to-process rather than a fixed count.

Nick says the first step is altering the "Your computer is too slow" log message to output a queue processing time to see how long most nodes are spending on their queues, to decide if this is worth the effort to change.

Child Tickets

Change History (11)

comment:1 Changed 7 years ago by mikeperry

Milestone: Tor: 0.2.4.x-final

I think I want to keep an eye on this and at least get the logline change deployed on the 0.2.4.x-alpha timescale. Is this small-feature or big-feature?

comment:2 Changed 7 years ago by nickm

I can't call anything small-feature definitively without seeing how the code is, but I don't think that it would take longer than an hour or two to get the logline written, assuming a sane simple implementation.

comment:3 Changed 7 years ago by mo

Cc: mo added

comment:4 Changed 7 years ago by nickm

Status: newneeds_review

I've done a branch "onionqueue_timer" to keep track of how long onions have been in the queue for before they get handed off to a cpuworker. If as I suspect the answer is something like (150 msec / NumCPUs), then we should consider a new approach.

A more reasonable approach is something like: Let T_max be the longest amount of time we're willing to queue an onionskin for. Let T_o be the time it takes to process an onionskin. Don't queue more than NumCPUs * (T_max / T_o) onionskins. Assuming a modest server like my laptop, setting T_max == 1 sec, NumCPUs == 4, and T_o == 2.2 msec, that gives something more like 1800 as a good limit for a modest server.

Having both ntor and TAP onionskins will complicate this a little, but not by too much.

If we start to recommend high values for MaxOnionsPending, we should look closely at the code for onion_pending_remove: It does a linear search over the queue, which is probably not so good if we're looking forward to a time of super-fast servers with leneant timeouts and a large volume of ntor traffic. (T_max = 2 s, NumCPUS == 16, T_o == 150 usec => Wow, would we even ever want to allow that?)

comment:5 Changed 7 years ago by nickm

Priority: normalmajor

Calling this major, because ntor servers will be very sad without it.

comment:6 Changed 7 years ago by mo

tor0 (notices0): MaxOnionsPending 250
tor1 (notices1): MaxOnionsPending 1000
tor2 (notices2): MaxOnionsPending 5000

Xeon E3 1230v2 Quad 3.3 Ghz (NumCPU 8), AES-NI, 8GB RAM, 1Gbit/s connection, at the moment pushing ~500 Mbit/s per direction (in sum of all three Tor instances)

notices0.log:Dec 17 10:25:03.000 [notice] Heartbeat: We have enqueued 1600521 onions, and rejected 130999 onions because our queue was too full. Of the enqueued onions, 1579288 were given to cpuworker after an average of 1510 msec, and 22015 expired after an average of 5020 msec.
notices1.log:Dec 17 10:25:04.000 [notice] Heartbeat: We have enqueued 1234061 onions, and rejected 0 onions because our queue was too full. Of the enqueued onions, 1173897 were given to cpuworker after an average of 1694 msec, and 63648 expired after an average of 5014 msec.
notices2.log:Dec 17 10:25:03.000 [notice] Heartbeat: We have enqueued 1441904 onions, and rejected 0 onions because our queue was too full. Of the enqueued onions, 1290844 were given to cpuworker after an average of 2231 msec, and 150854 expired after an average of 5013 msec.

comment:7 in reply to:  4 Changed 7 years ago by mikeperry

Replying to nickm:

If we start to recommend high values for MaxOnionsPending, we should look closely at the code for onion_pending_remove: It does a linear search over the queue, which is probably not so good if we're looking forward to a time of super-fast servers with leneant timeouts and a large volume of ntor traffic. (T_max = 2 s, NumCPUS == 16, T_o == 150 usec => Wow, would we even ever want to allow that?)

FWIW, on a good day, the circuit build timeout can get as low as 2-3 seconds. I don't think I've ever had a client get above 10 seconds. This makes me think a queue of 2 seconds is probably too long, especially if it means a ton more additional onionskins queued anyway.

Another way to approach this might be to work backwards and determine how long 100 onionskins would take on various NumCPU=1 servers today (or just estimate). Then you can set the limits for T_max to one of these values, which would leave things unchanged for today's handshake on single core servers but improve things for ntor and multicore servers?

comment:8 Changed 7 years ago by nickm

I've got an incomplete patch series in time_based_onionqueue, in need of review. It is less huge than it seems, since it is based on the ntor branch. See comments and commit messages there for more info.

The "incomplete" part is:

  • We need to handle jumping forward in time more gracefully. In 0.2.5 or something we might grow a monotonic clock, but right now, jumping in time really needs to keep us from believing that NOW - TIME_WHEN_WE_STARTED_X is the amount of time that X took. This problem is especially ouchy for the onionskin timing code.
  • I want a message that measures overhead -- how much time is used up by sending and receiving an onionskin from a cpuworker on top of its real processing time. I hope this is low, but you never know.

comment:9 Changed 7 years ago by nickm

Now that the ntor code is merged, this patch series now lives in a branch "time_based_onionqueue_v2" based on master.

I have added code to measure and report cpuworker overhead. Still need a fix for the clock-jumping thing.

comment:10 Changed 7 years ago by nickm

Okay, I added a brute-force solution for the clock-jumping thing. I'm now happy with this branch.

comment:11 Changed 7 years ago by andrea

Resolution: fixed
Status: needs_reviewclosed

This branch looks fine to me; I'll go ahead and merge.

Note: See TracTickets for help on using tickets.