Opened 8 months ago

Closed 7 months ago

Last modified 7 months ago

#9574 closed enhancement (implemented)

Process ntor create cells before tap create cells?

Reported by: arma Owned by:
Priority: normal Milestone: Tor: 0.2.4.x-final
Component: Tor Version:
Keywords: tor-relay, maybe-proposal Cc: iang, nickm, gordon@…, isis@…, tor-admin@…, norman@…
Actual Points: Parent ID: #9657
Points:

Description

Motivated by http://nsl.cs.columbia.edu/papers/2013/cellflood.raid13.pdf

A) In the attack case: If we're under attack by somebody flooding us with tap create cells, it would be nice if the ntor creates get processed before this flood. If this strategy forces them to flood us with ntor create cells instead, that raises the expense.

B) In the normal case: Since handling ntor create cells is faster than handling tap create cells anyway, we could get them out of the way earlier and improve performance even more for folks using ntor-based circuit handshakes.

We already prioritize create-fast cells in exactly this way, though implementation-wise it'll probably be different. For the implementation here, maybe we'll be happiest just keeping two onionskin queues, one for each type.

The only downside I can see is that it'll be harder to measure how much of a performance improvement we get from ntor creates, since now we speed it up in two ways that are hard to separate.

Child Tickets

Attachments (3)

TICKET-9574-log (120.5 KB) - added by isis 8 months ago.
TICKET-9574-torrc (2.3 KB) - added by isis 8 months ago.
2013-09-02.tor.info.log (104.7 KB) - added by isis 8 months ago.

Download all attachments as: .zip

Change History (38)

comment:1 follow-up: Changed 8 months ago by mikeperry

  • Cc nickm added

We've begun to see this in the form of a botnet/popular software update that has decided to use Tor: https://metrics.torproject.org/users.html. Many relays are already at 100% CPU and are dropping onionskins as a result. Based on the lack of increase in users in Iran and other places where 0.2.3.x is blocked but 0.2.4.x is not, it is likely that the botnet is using 0.2.3.x and TAP.

I'm also wondering if it wouldn't be simpler just to throw our hands up on 0.2.3.x and write a patch to disable TAP entirely via a consensus parameter. I am not sure how much longer the Tor network will remain functional under this load.

comment:2 in reply to: ↑ 1 Changed 8 months ago by rransom

Replying to mikeperry:

I'm also wondering if it wouldn't be simpler just to throw our hands up on 0.2.3.x and write a patch to disable TAP entirely via a consensus parameter. I am not sure how much longer the Tor network will remain functional under this load.

That would (a) make Debian unhappy, and (b) completely break the hidden service protocol.

comment:3 Changed 8 months ago by mikeperry

Well we would still get a considerable benefit from only applying the TAP queuing/disabling changes to CREATE cells without hurting hidden service INTRO cells.. But yeah, the Linux distros will all be sad. Pretty sure none of them ship 0.2.4.x as a stable package yet. Wouldn't be the first time we've had to make them sad, but this one would be even more sudden and unexpected for them...

I suppose it depends on if that steep growth rate continues, and what that does to the Tor network over the next few days. It will make everybody sad if circuit success rates fall to like 10% (and that itself will also start to have negative synergistic effects on the CPU overload condition, because all the clients will just keep trying to make more circuits). At that point, forcing the Linux distributions to upgrade might start to look like a favorable choice when compared to no Tor at all.

comment:4 Changed 8 months ago by arma

Please start your own ticket if you want to discuss a "disable TAP" feature/bugfix. That isn't this ticket. :)

Getting back to the original ticket, it might be interesting to only give people the "your computer is too slow!" warning when their ntor queue gets too full. Then we do tap cells in a best-effort way, in separate cpuworkers from the main thread, maybe even (in a future ticket) rate limiting how many tap cells we're willing to handle per unit time.

comment:5 Changed 8 months ago by gmorehouse

  • Cc gordon@… added

comment:6 follow-up: Changed 8 months ago by iang

I also noticed a little while ago that the way the TAP cells are currently farmed out to cpuworkers may not be the most efficient way we could do things. There are quite a few system calls happening for each TAP cell right now; something more clever could surely be done. With ntor, the system call overhead would likely be a substantial fraction of the total processing time. But, as above, this would be a separate ticket.

comment:7 Changed 8 months ago by gmorehouse

I'd just like to chime in with the fact that, running 0.2.3.x a couple months ago on a Raspberry Pi, I'd see transient "circuit creation storms" characterized by several thousand "your computer is too slow to handle this many circuit creation attempts" messages suppressed as duplicate in one second in the logs. The Pi is a low resource machine with a slow processor. After later upgrading to Tor 0.2.4.x, this decreased very much and Tor used less CPU in general; but after this DDOS-like activity started, the Pi has been acting like a canary in a coal mine. It actually crashed for the first time (out-of-memory killed) last night on 0.2.4.16-rc. Meanwhile I've barely seen a ripple (at least big enough to warrant any logging, circuits are up) on my VPS relays.

This makes me wonder two things.

  1. Was I seeing a "test run" a couple months back on the Pi running 0.2.3.x? Or was that "normal" activity?
  1. Wouldn't thousands of "two slow" messages per second, if occurring under "normal" (though suboptimal) network conditions and with a reasonable MaxAdvertisedBandwidth on a 700MHz ARM chip, be considered a bug in its own right? I wanted to bring it up because Roger responded to my original questions[1] and suggested it was a known issue with the normal (though in this case suboptimal) operation of the Tor network. None of the tickets he mentioned dealt with huge amounts of "too slow to handle this many creation requests" messages, though; I wonder if any of the tickets he mentioned, though[2][3][4], are points that the DDOS may be exploiting? Food for thought.

[1] https://lists.torproject.org/pipermail/tor-relays/2013-June/002184.html
[2] https://trac.torproject.org/projects/tor/ticket/3825
[3] https://trac.torproject.org/projects/tor/ticket/4862
[4] https://trac.torproject.org/projects/tor/ticket/8950

comment:8 Changed 8 months ago by arma

See my feature9574 branch for some dabbling.

It's based on maint-0.2.4, on the theory that we are increasingly realizing that the hypothetical DoS attack from this ticket is occurring right now via our botnet, and it's growing increasingly urgent to rescue our relays from CPU overload.

The branch doesn't have a changes file, could use some more refactoring, and it leaks everything at the end of the unit tests; but hopefully it is useful for somebody working on this ticket.

I'm running it on moria5 right now.

comment:9 Changed 8 months ago by arma

  • Status changed from new to needs_review

comment:10 Changed 8 months ago by arma

See my branch feature9574-with-logs for one that has info-level logs to help you track onion queue sizes.

moria5 doesn't attract enough create cells to ever queue anything, so it would be great if somebody could test this branch on a busy (cpu-overloaded) relay.

comment:11 Changed 8 months ago by arma

Ok, Sathya tested it on his relay and showed me the logs. Looks like it's behaving as expected.

So remaining is to clean up the patches as needed (including freeing memory from the unit tests), and then convince Nick that it should go into 0.2.4.

comment:12 follow-ups: Changed 8 months ago by nickm

Hi, I'm not fully here till Tuesday, but this looks important, so I'll look at this quick.

Some points from this patch:

  • I would like some extra paranoia in every function that indexes ol_list, to make sure that the list index is in range. (Log an LD_BUG and return, in other words.)
  • The code in onion_next_task is too aggressive: it does "never answer a TAP request while any ntor request is pending", which means that in practice I doubt we'll answer TAP requests at all on a busy node. Here are some other ideas we could take:
    • Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)
    • When we have both ntor and TAP requests, choose an ntor request with probability P. (P=.8? P=.9?)
    • When we have both ntor and TAP requests, choose an ntor request unless the oldest pending TAP request is N msec older than the oldest pending ntor request. (N=???)
    • What else?


Also, does this imply that we ought to start designing a handshake with scalable client proof-of-something?


comment:13 in reply to: ↑ 12 Changed 8 months ago by isis

  • Cc isis@… added

I tested and this seems to be working, I also briefly reviewed the patches. Attached is the torrc for that relay and some info logs plus the events from SETEVENTS EXTENDED CIRC CLIENTS_SEEN INFO DESCCHANGED BUILDTIMEOUT_SET.

CPU went from wavering around 400% (with NumCPUs 4) to 80%-140%. I had one failure before this patch from lack of available fds, which had been ~3900 out of 4096 for several days -- now fd levels are at ~1200/4096. It's still getting CIRC FAILEDs, but I think this is from the other relays not being able to handle more circuits.

Replying to nickm:

  • The code in onion_next_task is too aggressive: it does "never answer a TAP request while any ntor request is pending", which means that in practice I doubt we'll answer TAP requests at all on a busy node.

Mine isn't the biggest exit relay, I think it's only 0.15% of the total exit capacity, but the TAP queues so far have stayed pretty low. That might change if more relays start running with these patches.

Here are some other ideas we could take:

  • Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)
  • When we have both ntor and TAP requests, choose an ntor request with probability P. (P=.8? P=.9?)
  • When we have both ntor and TAP requests, choose an ntor request unless the oldest pending TAP request is N msec older than the oldest pending ntor request. (N=???)
  • What else?


Also, does this imply that we ought to start designing a handshake with scalable client proof-of-something?

I wanted PoW for BridgeDB, researched it a bit (see #7520:comment14), wasn't very hopeful about finding a workind PoW scheme, and then phw convinced me that anything we expect a Tor client to do, an adversary can certainly do. Though I would really love to see this proven wrong.

Changed 8 months ago by isis

Changed 8 months ago by isis

comment:14 Changed 8 months ago by torland

  • Cc tor-admin@… added

comment:15 Changed 8 months ago by isis

To update, now I'm not sure how well this is working. It's definitely not doing worse, and it might be doing better. But I am once again getting warning messages on my test relay that tor's fd usage is at 90% of its maximum. The NTOR queue is pretty consistently empty, and the TAP queue seems to fluctuate around the 300-400 range. One problem which seems to have surfaced is that the EXTENDCIRCUIT cells have expired by the time they reach the front of the TAP queue, so Nick's idea to create some sort of probabilistic prioritization of NTOR requests is probably a good idea.

9/2/2013 17:09:33 [INFO] onion_pending_add(): Circuit create request is too old; canceling due to overload.
9/2/2013 17:09:33 [INFO] onion_pending_add(): Circuit create request is too old; canceling due to overload.
9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=323.
9/2/2013 17:09:33 [INFO] channel_register(): Channel 0xa9551388 (global ID 321112) in state opening (1) registered with no identity digest
9/2/2013 17:09:33 [INFO] command_process_created_cell(): (circID 39642) unknown circ (probably got a destroy earlier). Dropping.
9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=322.
9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=321.
9/2/2013 17:09:33 [INFO] channel_tls_process_versions_cell(): Negotiated version 3 with [scrubbed]:15944; Sending cells: VERSIONS CERTS AUTH_CHALLENGE NETINFO
9/2/2013 17:09:33 [INFO] connection_edge_process_relay_cell(): end cell (misc error) dropped, unknown stream.
9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=320.
9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=319.
9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=318.

More info logs attached.

Changed 8 months ago by isis

comment:16 in reply to: ↑ 12 ; follow-up: Changed 8 months ago by mikeperry

Replying to nickm:

  • The code in onion_next_task is too aggressive: it does "never answer a TAP request while any ntor request is pending", which means that in practice I doubt we'll answer TAP requests at all on a busy node. Here are some other ideas we could take:
    • Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)

I think this one is my favorite for KISS reasons. We should have a consensus parameter for it too, of course. I kind of think we should kill TAP with fire and set this at like 100 though, and maybe even provide a -1 or other magic value that always services ntor first.

  • When we have both ntor and TAP requests, choose an ntor request with probability P. (P=.8? P=.9?)

I loves me some stochastic codez, but this seems overkill?

  • When we have both ntor and TAP requests, choose an ntor request unless the oldest pending TAP request is N msec older than the oldest pending ntor request. (N=???)

I don't like this one because it will be hard to tune and may end up still letting through a lot of TAP cells over ntor, especially under conditions like we have now (where we have very little ntor traffic, and a ton of TAP traffic).

  • What else?

Should we add extra-info lines for how many CREATE_FAST, TAP and ntor cells each node is processing? Or is that too sensitive?

Also, does this imply that we ought to start designing a handshake with scalable client proof-of-something?

Seems like it. :/

comment:17 Changed 8 months ago by weuxel

  • Cc norman@… added

comment:18 follow-up: Changed 8 months ago by nickm

I'm also concerned that this code:

+  if (type == ONION_HANDSHAKE_TYPE_NTOR &&
+      ntor_usec / 1000 > (uint64_t)options->MaxOnionQueueDelay)
+    return 0;
+
+  if (type == ONION_HANDSHAKE_TYPE_TAP &&
+      (tap_usec + ntor_usec) / 1000 > (uint64_t)options->MaxOnionQueueDelay)
     return 0;

will lead to us just not accepting any TAP onionskins beyond the 50 that we always accept for each type.

comment:19 in reply to: ↑ 12 Changed 8 months ago by arma

Replying to nickm:

  • I would like some extra paranoia in every function that indexes ol_list, to make sure that the list index is in range. (Log an LD_BUG and return, in other words.)

Done in git commit 18ca5c52 (branch still feature9574-with-logs)

  • Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)

Done in git commit 10ed23e9. (Ok, I didn't quite do that -- I did "pick the tiebreaker in favor of ntor 5 times out of 6". Hopefully that's close enough to the same idea.)

Also, does this imply that we ought to start designing a handshake with scalable client proof-of-something?

Maybe, if you can figure out what to prove. These clients are normal clients, so they have CPU to burn.

comment:20 Changed 8 months ago by nickm

Those look okay, I think, though 10ed23e9 needs some tests. And that 5 should really be a networkstatus parameter or something though IMO.

What do you think about that the concern that I raised in comment:18 above?

comment:21 in reply to: ↑ 16 ; follow-ups: Changed 8 months ago by arma

Replying to mikeperry:

  • Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)

I think this one is my favorite for KISS reasons. We should have a consensus parameter for it too, of course. I kind of think we should kill TAP with fire and set this at like 100 though, and maybe even provide a -1 or other magic value that always services ntor first.

Added the consensus param in git commit 2d942f348. I went with 10 ntors per tap as the default, but I'd be amenable to having us set a bigger number like 100 in the consensus for now.

I didn't implement the 'magic value' part. But values like 10000 should be plenty big.

comment:22 in reply to: ↑ 18 Changed 8 months ago by arma

Replying to nickm:

I'm also concerned that this code:

+  if (type == ONION_HANDSHAKE_TYPE_NTOR &&
+      ntor_usec / 1000 > (uint64_t)options->MaxOnionQueueDelay)
+    return 0;
+
+  if (type == ONION_HANDSHAKE_TYPE_TAP &&
+      (tap_usec + ntor_usec) / 1000 > (uint64_t)options->MaxOnionQueueDelay)
     return 0;

will lead to us just not accepting any TAP onionskins beyond the 50 that we always accept for each type.

If our ntor queue remains substantially empty, then we'll accept tap cells about like we did before.

If our ntor queue gets pretty big, and our processing estimates indicate it will take a long time to answer them, then we *should* refuse tap cells quickly, since we're unlikely to get around to them anytime soon.

This reasoning is complicated by the new NumNTorsPerTAP consensus param though. Maybe we should do some complex math to reason about the sum of the queued onionskins times the fraction of each we plan to get to?

comment:23 follow-up: Changed 8 months ago by arma

Ok, I futzed with the math for a bit. For the TAP expected delay, assuming num_ntors_per_tap() is 5 or 10, I think it is approximately the case that you *will* have to wait for all the ntor cells and all the create cells to be processed before you'll be processed. If we have too many of both of them queued already, we *should* preemptively tell you to give up.

Are you thinking of the distant future where we're overwhelmed by ntor cells and the few remaining tap cells starve?

And for the NTor expected delay, we should now expect it to be the delay from queued NTor cells, plus the delay for the fraction 1/num_ntors_per_tap of the queued tap cells, assuming there are that many queued.

comment:24 in reply to: ↑ 21 Changed 8 months ago by mikeperry

Replying to arma:

Replying to mikeperry:

  • Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)

I think this one is my favorite for KISS reasons. We should have a consensus parameter for it too, of course. I kind of think we should kill TAP with fire and set this at like 100 though, and maybe even provide a -1 or other magic value that always services ntor first.

Added the consensus param in git commit 2d942f348. I went with 10 ntors per tap as the default, but I'd be amenable to having us set a bigger number like 100 in the consensus for now.

I didn't implement the 'magic value' part. But values like 10000 should be plenty big.

In this commit, you actually did #define MAX_NUM_NTORS_PER_TAP 1000. If ntor is only 10X faster than tap, this is still 10/1010 = 1% of the CPU wasted on TAP if we really want to kill it. If we optimize ntor further, to say 50X faster than TAP, this is 50/1050 or 4.7% CPU wasted on TAP, even after we want it to die.

comment:25 in reply to: ↑ 21 Changed 8 months ago by arma

Replying to arma:

Added the consensus param in git commit 2d942f348. I went with 10 ntors per tap as the default, but I'd be amenable to having us set a bigger number like 100 in the consensus for now.

I rebased it to 7263d9c3 to a) fix the actual default, and b) raise the max to a much bigger number.

comment:26 in reply to: ↑ 23 Changed 8 months ago by arma

Replying to arma:

And for the NTor expected delay, we should now expect it to be the delay from queued NTor cells, plus the delay for the fraction 1/num_ntors_per_tap of the queued tap cells, assuming there are that many queued.

See git commit 533adef1d:

    Be more general in calculating expected onion queue processing time
    
    Now we consider the TAP cells we'll process while draining the NTor
    queue, and vice versa.

I think for the upcoming cases we expect to see in the wild, it does simplify to "refuse a lot of tap requests if there are many ntor requests". And I'm ok with that.

comment:27 Changed 8 months ago by arma

I pushed a changes stanza too.

Nick, would you kindly evaluate this whole branch for merging now, and help make it into something you're willing to merge (you mentioned wanting unit tests for some things).

comment:28 Changed 8 months ago by arma

  • Milestone changed from Tor: 0.2.5.x-final to Tor: 0.2.4.x-final

comment:29 in reply to: ↑ 6 ; follow-up: Changed 8 months ago by arma

Replying to iang:

I also noticed a little while ago that the way the TAP cells are currently farmed out to cpuworkers may not be the most efficient way we could do things. There are quite a few system calls happening for each TAP cell right now; something more clever could surely be done. With ntor, the system call overhead would likely be a substantial fraction of the total processing time. But, as above, this would be a separate ticket.

Ian, can you explain in some more detail? We should explore doing this in 0.2.5. If you have enough details to have a concrete suggestion, please make a new ticket out of it.

Thanks!

comment:30 Changed 8 months ago by arma

  • Parent ID set to #9657

comment:31 in reply to: ↑ 29 Changed 8 months ago by iang

Replying to arma:

Replying to iang:

I also noticed a little while ago that the way the TAP cells are currently farmed out to cpuworkers may not be the most efficient way we could do things. There are quite a few system calls happening for each TAP cell right now; something more clever could surely be done. With ntor, the system call overhead would likely be a substantial fraction of the total processing time. But, as above, this would be a separate ticket.

Ian, can you explain in some more detail? We should explore doing this in 0.2.5. If you have enough details to have a concrete suggestion, please make a new ticket out of it.

Thanks!

It's been a little bit since I last looked at that code, but from what I remember, the create cells are written to one of n sockets, where n is the number of cpuworkers. Each cpuworker reads its end of the socket to get the data out, processes it, and writes the response back to the socket.

I have a feeling the system calls involved in the write/select/read/write/select/read may be taking a nontrivial amount of time. (But I emphasize that I did not measure this in any way.) If we're threading (as opposed to forking), we'll be in the same memory space, and something with semaphore-controlled processing of a shared data structure (or something like that) is likely to be faster (but perhaps not simpler).

comment:32 follow-ups: Changed 7 months ago by nickm_mobile

Okay, I'm mostly satisfied with this as of the latest commit (e427b8368445 I think).

One issue: In the "no taps? Try ntor" case, I believe we should increment recently_chosen_ntors for correctness. Other than that, I call this merge-okay.

Iang: once I am back on the real internet and can search around better to make sure such a ticket doesn't already exist, I want to open a new ticket for that stuff. Andrea's work-in-progress code for her parallel_relay_crypto_rebase_2 branch might have something repurposable. So might some code Mark Ellzey did for his libevhtp helper lib. I think the overhead in cycles is still minor in comparison to the handshakes, but the latency could matter.

comment:33 in reply to: ↑ 32 Changed 7 months ago by arma

Replying to nickm_mobile:

In the "no taps? Try ntor" case, I believe we should increment recently_chosen_ntors for correctness. Other than that, I call this merge-okay.

I pushed to the feature9574-with-logs branch again with this fix. (Mike and I disagree that we should do it this way, but I think it doesn't matter now and who knows, maybe Nick's distant-future crystal ball is better than mine.

I'm going to merge it now.

comment:34 Changed 7 months ago by arma

  • Resolution set to implemented
  • Status changed from needs_review to closed

Merged. Hope I didn't screw up resolving the conflict in master.

comment:35 in reply to: ↑ 32 Changed 7 months ago by nickm

Replying to nickm_mobile:

Iang: once I am back on the real internet and can search around better to make sure such a ticket doesn't already exist, I want to open a new ticket for that stuff. Andrea's work-in-progress code for her parallel_relay_crypto_rebase_2 branch might have something repurposable. So might some code Mark Ellzey did for his libevhtp helper lib. I think the overhead in cycles is still minor in comparison to the handshakes, but the latency could matter.

Added this as #9682.

Note: See TracTickets for help on using tickets.