Use a better set of comparison/evaluation functions for deciding which connections to kill when OOS

changed milestone to %Tor: unspecified

added 034-removed-20180328 034-triage-20180328 component::core tor/tor milestone::Tor: unspecified points::2 priority::high severity::normal sockets status::new tor-dos type::defect labels

Triaged out on December 2016 from 030 to 031.

Trac:
Keywords: N/A deleted, triage-out-030-201612 added
Milestone: Tor: 0.3.0.x-final to Tor: 0.3.1.x-final

Trac:
Owner: N/A to nickm
Status: new to accepted

Trac:
Points: 1 to 2

Lower priority on some of my assigned tickets

Trac:
Priority: Medium to Low

So, what's the best strategy here? We'd like to emphasize connections that are getting lots of usage, but only real usage. The existing code kills whatever OR connections have the fewest circuits, and leaves everything else alone. But if DirPort is open, or if we're an exit, that can be really bad.

My first thought was to treat directory server connections and exit as if they had one circuit, and then to rank them by number of circuits along with the OR connections. But maybe that's vulnerable too? An attacker could just start a bunch of clients, open two circuits from each, and get an exit to kill off all its exit connections. Probably not so good.

Should we look at last-written time, or queue age, or something else? There may be cleverness needed.

Trac:
Keywords: N/A deleted, 031-reach added

I don't think these are happening in 0.3.1

Trac:
Milestone: Tor: 0.3.1.x-final to Tor: 0.3.2.x-final

Trac:
Keywords: triage-out-030-201612 deleted, N/A added

Trac:
Keywords: 031-reach deleted, N/A added

Trac:
Sponsor: N/A to SponsorV-can

Trac:
Milestone: Tor: 0.3.2.x-final to Tor: 0.3.3.x-final

normally, one would use IP reputation to deal with spamming attacks. however, for obvious reasons, I can see why that might be frowned upon in these circles.

therefore, some other unfalsifiable proof of work is required. one could implement a custom proof-of-work protocol, but it seems more useful to me to measure the bandwidth used. this incurs negligible overhead for legitimate users, but has the added benefit that attackers are forced to encrypt their data in order to increase their bandwidth usage. additionally, if attackers have vastly more bandwidth than you, they can simply mount a traditional DoS attack anyways.

tl;dr just sort connections by recently used valid data traffic.

directory connections are poorly impacted by this metric, but:

if the connection is legitimate, there will be data flowing down it soon after fetching directory information anyways. works better with the new ORPort-only architecture, but for legacy clients I guess we could just sum together the bandwidth used by an IP address and use that somehow
AIUI directory connections are only absolutely necessary during the very first startup. at any later time, if a directory connection cannot be made or is suddenly terminated, cached data can be temporarily used until a connection can be re-established. therefore, prematurely terminating directory connections is not a huge problem, and is much better than rejecting new connections which may require relay service.

raising priority as discussed on IRC.

Trac:
Priority: Low to High

I wonder if we should try for a load-balancing metric instead.

A recent tor-relays thread discovered that limiting the connections from each IPv4 /16 resolves this issue: https://lists.torproject.org/pipermail/tor-relays/2017-December/013776.html

The equivalent IPv6 netblock would be a /32, the minimum regional internet registry allocation block size.

We could identify the /16s or /32s with the largest numbers of connections, and kill those first, using one of the other "usefulness" heuristics.

Replying to teor:

I wonder if we should try for a load-balancing metric instead.

A recent tor-relays thread discovered that limiting the connections from each IPv4 /16 resolves this issue: https://lists.torproject.org/pipermail/tor-relays/2017-December/013776.html

The equivalent IPv6 netblock would be a /32, the minimum regional internet registry allocation block size.

We could identify the /16s or /32s with the largest numbers of connections, and kill those first, using one of the other "usefulness" heuristics.

I think that might work, but I don't see why that would be any better than using only bandwidth consumed. In fact, I think that would have the same issue that I mentioned on the list of overkilling NATed clients, potentially the ones most in need of anonymity! IIRC, Cleanfeed is known to proxy all connections through a small number of IPs; I wouldn't be surprised if China, Iran and company did the same.

Filtering based on bandwidth used is reputation-neutral, has zero false positives, and has near-zero added cost.

in other words, it's impossible to, using netblocks only, distinguish between "real" clients behind some mobile network's carrier-grade NAT and a bunch of regular clients on a VPS somewhere.

hm... upon further consideration though, perhaps it would be possible to use a memory-hard proof of work algorithm here. even phones under $100 have at least 2 GB of RAM, so completing an occasional 1 GB POW should only momentarily slow the device. should be easy on battery life too, unlike a CPU POW. I did a quick calculation and an attacker would need s = cfn, where s is the required server RAM, c is the challenge difficulty, f is the frequency, and n is the number of connections to be held, and if c = 1 GB * 3 sec, f = 1/10 min, n = 200, then s = 1 GB, or around $5/month per 200 connections, which seems sufficiently expensive to deter this particular attack. however, there are a number of downsides to this plan. not only does it require additional protocol design (time which could be spent doing something else, like IPv6 support), I hear the iOS Tor people are limited to 15 MB, so even if the device has 10 GB of RAM that won't help. I figure "you must reopen the Tor app every ten minutes to maintain your connection" is not a good solution.

hm... perhaps we could use both: clients that require long-running connections for things like IRC must submit proofs of work (either CPU or memory), and iOS clients just have to live with occasionally re-establishing their connections if the relay is under DoS.

regardless, in the short to medium term, we should probably implement the bandwidth method. this latest surge in fake clients has noticeably increased number of "users": https://metrics.torproject.org/userstats-relay-country.html?start=2017-12-01&end=2017-12-15&country=all&events=on. fortunately, performance has not yet been seriously affected, but it is plausible that they will get more servers online and the curve will continue going up.

in terms of implementation, is it correct that right now, only OR connections count recent traffic?

Mark a lot of assigned/needs_revision tickets as 0.3.4. If you think this should happen in 0.3.3 instead, just let me know?

Trac:
Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final

Rename keyword "dos" to "tor-dos"

Trac:
Keywords: dos deleted, tor-dos added

Use a better set of comparison/evaluation functions for deciding which connections to kill when OOS

Child items ...

Activity