A) In the attack case: If we're under attack by somebody flooding us with tap create cells, it would be nice if the ntor creates get processed before this flood. If this strategy forces them to flood us with ntor create cells instead, that raises the expense.
B) In the normal case: Since handling ntor create cells is faster than handling tap create cells anyway, we could get them out of the way earlier and improve performance even more for folks using ntor-based circuit handshakes.
We already prioritize create-fast cells in exactly this way, though implementation-wise it'll probably be different. For the implementation here, maybe we'll be happiest just keeping two onionskin queues, one for each type.
The only downside I can see is that it'll be harder to measure how much of a performance improvement we get from ntor creates, since now we speed it up in two ways that are hard to separate.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
We've begun to see this in the form of a botnet/popular software update that has decided to use Tor: https://metrics.torproject.org/users.html. Many relays are already at 100% CPU and are dropping onionskins as a result. Based on the lack of increase in users in Iran and other places where 0.2.3.x is blocked but 0.2.4.x is not, it is likely that the botnet is using 0.2.3.x and TAP.
I'm also wondering if it wouldn't be simpler just to throw our hands up on 0.2.3.x and write a patch to disable TAP entirely via a consensus parameter. I am not sure how much longer the Tor network will remain functional under this load.
I'm also wondering if it wouldn't be simpler just to throw our hands up on 0.2.3.x and write a patch to disable TAP entirely via a consensus parameter. I am not sure how much longer the Tor network will remain functional under this load.
That would (a) make Debian unhappy, and (b) completely break the hidden service protocol.
Well we would still get a considerable benefit from only applying the TAP queuing/disabling changes to CREATE cells without hurting hidden service INTRO cells.. But yeah, the Linux distros will all be sad. Pretty sure none of them ship 0.2.4.x as a stable package yet. Wouldn't be the first time we've had to make them sad, but this one would be even more sudden and unexpected for them...
I suppose it depends on if that steep growth rate continues, and what that does to the Tor network over the next few days. It will make everybody sad if circuit success rates fall to like 10% (and that itself will also start to have negative synergistic effects on the CPU overload condition, because all the clients will just keep trying to make more circuits). At that point, forcing the Linux distributions to upgrade might start to look like a favorable choice when compared to no Tor at all.
Please start your own ticket if you want to discuss a "disable TAP" feature/bugfix. That isn't this ticket. :)
Getting back to the original ticket, it might be interesting to only give people the "your computer is too slow!" warning when their ntor queue gets too full. Then we do tap cells in a best-effort way, in separate cpuworkers from the main thread, maybe even (in a future ticket) rate limiting how many tap cells we're willing to handle per unit time.
I also noticed a little while ago that the way the TAP cells are currently farmed out to cpuworkers may not be the most efficient way we could do things. There are quite a few system calls happening for each TAP cell right now; something more clever could surely be done. With ntor, the system call overhead would likely be a substantial fraction of the total processing time. But, as above, this would be a separate ticket.
I'd just like to chime in with the fact that, running 0.2.3.x a couple months ago on a Raspberry Pi, I'd see transient "circuit creation storms" characterized by several thousand "your computer is too slow to handle this many circuit creation attempts" messages suppressed as duplicate in one second in the logs. The Pi is a low resource machine with a slow processor. After later upgrading to Tor 0.2.4.x, this decreased very much and Tor used less CPU in general; but after this DDOS-like activity started, the Pi has been acting like a canary in a coal mine. It actually crashed for the first time (out-of-memory killed) last night on 0.2.4.16-rc. Meanwhile I've barely seen a ripple (at least big enough to warrant any logging, circuits are up) on my VPS relays.
This makes me wonder two things.
Was I seeing a "test run" a couple months back on the Pi running 0.2.3.x? Or was that "normal" activity?
Wouldn't thousands of "two slow" messages per second, if occurring under "normal" (though suboptimal) network conditions and with a reasonable MaxAdvertisedBandwidth on a 700MHz ARM chip, be considered a bug in its own right? I wanted to bring it up because Roger responded to my original questions[1] and suggested it was a known issue with the normal (though in this case suboptimal) operation of the Tor network. None of the tickets he mentioned dealt with huge amounts of "too slow to handle this many creation requests" messages, though; I wonder if any of the tickets he mentioned, though[2][3][4], are points that the DDOS may be exploiting? Food for thought.
It's based on maint-0.2.4, on the theory that we are increasingly realizing that the hypothetical DoS attack from this ticket is occurring right now via our botnet, and it's growing increasingly urgent to rescue our relays from CPU overload.
The branch doesn't have a changes file, could use some more refactoring, and it leaks everything at the end of the unit tests; but hopefully it is useful for somebody working on this ticket.
See my branch feature9574-with-logs for one that has info-level logs to help you track onion queue sizes.
moria5 doesn't attract enough create cells to ever queue anything, so it would be great if somebody could test this branch on a busy (cpu-overloaded) relay.
Hi, I'm not fully here till Tuesday, but this looks important, so I'll look at this quick.
Some points from this patch:
I would like some extra paranoia in every function that indexes ol_list, to make sure that the list index is in range. (Log an LD_BUG and return, in other words.)
The code in onion_next_task is too aggressive: it does "never answer a TAP request while any ntor request is pending", which means that in practice I doubt we'll answer TAP requests at all on a busy node. Here are some other ideas we could take:
Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)
When we have both ntor and TAP requests, choose an ntor request with probability P. (P=.8? P=.9?)
When we have both ntor and TAP requests, choose an ntor request unless the oldest pending TAP request is N msec older than the oldest pending ntor request. (N=???)
What else?
Also, does this imply that we ought to start designing a handshake with scalable client proof-of-something?
I tested and this seems to be working, I also briefly reviewed the patches. Attached is the torrc for that relay and some info logs plus the events from SETEVENTS EXTENDED CIRC CLIENTS_SEEN INFO DESCCHANGED BUILDTIMEOUT_SET.
CPU went from wavering around 400% (with NumCPUs 4) to 80%-140%. I had one failure before this patch from lack of available fds, which had been ~3900 out of 4096 for several days -- now fd levels are at ~1200/4096. It's still getting CIRC FAILEDs, but I think this is from the other relays not being able to handle more circuits.
The code in onion_next_task is too aggressive: it does "never answer a TAP request while any ntor request is pending", which means that in practice I doubt we'll answer TAP requests at all on a busy node.
Mine isn't the biggest exit relay, I think it's only 0.15% of the total exit capacity, but the TAP queues so far have stayed pretty low. That might change if more relays start running with these patches.
Here are some other ideas we could take:
* Always answer at least N ntor requests for every 1 TAP request, if we have both. (N=5? 10?)
* When we have both ntor and TAP requests, choose an ntor request with probability P. (P=.8? P=.9?)
* When we have both ntor and TAP requests, choose an ntor request unless the oldest pending TAP request is N msec older than the oldest pending ntor request. (N=???)
* What else?
Also, does this imply that we ought to start designing a handshake with scalable client proof-of-something?
I wanted PoW for BridgeDB, researched it a bit (see #7520:comment14), wasn't very hopeful about finding a workind PoW scheme, and then phw convinced me that anything we expect a Tor client to do, an adversary can certainly do. Though I would really love to see this proven wrong.
To update, now I'm not sure how well this is working. It's definitely not doing worse, and it might be doing better. But I am once again getting warning messages on my test relay that tor's fd usage is at 90% of its maximum. The NTOR queue is pretty consistently empty, and the TAP queue seems to fluctuate around the 300-400 range. One problem which seems to have surfaced is that the EXTENDCIRCUIT cells have expired by the time they reach the front of the TAP queue, so Nick's idea to create some sort of probabilistic prioritization of NTOR requests is probably a good idea.
9/2/2013 17:09:33 [INFO] onion_pending_add(): Circuit create request is too old; canceling due to overload.9/2/2013 17:09:33 [INFO] onion_pending_add(): Circuit create request is too old; canceling due to overload.9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=323.9/2/2013 17:09:33 [INFO] channel_register(): Channel 0xa9551388 (global ID 321112) in state opening (1) registered with no identity digest9/2/2013 17:09:33 [INFO] command_process_created_cell(): (circID 39642) unknown circ (probably got a destroy earlier). Dropping.9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=322.9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=321.9/2/2013 17:09:33 [INFO] channel_tls_process_versions_cell(): Negotiated version 3 with [scrubbed]:15944; Sending cells: VERSIONS CERTS AUTH_CHALLENGE NETINFO9/2/2013 17:09:33 [INFO] connection_edge_process_relay_cell(): end cell (misc error) dropped, unknown stream.9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=320.9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=319.9/2/2013 17:09:33 [INFO] onion_pending_add(): New create (tap). Queues now ntor=0 and tap=318.