Opened 3 years ago

Last modified 14 months ago

#19984 accepted defect

Use a better set of comparison/evaluation functions for deciding which connections to kill when OOS

Reported by: nickm Owned by: nickm
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: sockets, tor-dos, 034-triage-20180328, 034-removed-20180328
Cc: andrea Actual Points:
Parent ID: Points: 2
Reviewer: Sponsor: SponsorV-can

Description

Our existing OOS code kills low-priority OR connections. But really, we need to look at all connections that an adversary might be able to create (especially dir and exit connections), or else an adversary will be able to open a bunch of those, and force us to kill as many OR connections as they want.

This problem is the reason that DisableOOSCheck is now on-by-default.

Child Tickets

TicketStatusOwnerSummaryComponent
#925newTor fails badly when accept(2) returns EMFILE or ENFILECore Tor/Tor

Change History (21)

comment:1 Changed 2 years ago by dgoulet

Keywords: triage-out-030-201612 added
Milestone: Tor: 0.3.0.x-finalTor: 0.3.1.x-final

Triaged out on December 2016 from 030 to 031.

comment:2 Changed 2 years ago by nickm

Owner: set to nickm
Status: newaccepted

comment:3 Changed 2 years ago by nickm

Points: 12

comment:4 Changed 2 years ago by nickm

Priority: MediumLow

Lower priority on some of my assigned tickets

comment:5 Changed 2 years ago by nickm

So, what's the best strategy here? We'd like to emphasize connections that are getting lots of usage, but only real usage. The existing code kills whatever OR connections have the fewest circuits, and leaves everything else alone. But if DirPort is open, or if we're an exit, that can be really bad.

My first thought was to treat directory server connections and exit as if they had one circuit, and then to rank them by number of circuits along with the OR connections. But maybe that's vulnerable too? An attacker could just start a bunch of clients, open two circuits from each, and get an exit to kill off all its exit connections. Probably not so good.

Should we look at last-written time, or queue age, or something else? There may be cleverness needed.

comment:6 Changed 2 years ago by nickm

Keywords: 031-reach added

comment:7 Changed 2 years ago by nickm

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

I don't think these are happening in 0.3.1

comment:8 Changed 2 years ago by nickm

Keywords: triage-out-030-201612 removed

comment:9 Changed 2 years ago by nickm

Keywords: 031-reach removed

comment:10 Changed 23 months ago by nickm

Sponsor: SponsorV-can

comment:11 Changed 21 months ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

comment:12 Changed 18 months ago by Hello71

normally, one would use IP reputation to deal with spamming attacks. however, for obvious reasons, I can see why that might be frowned upon in these circles.

therefore, some other unfalsifiable proof of work is required. one could implement a custom proof-of-work protocol, but it seems more useful to me to measure the bandwidth used. this incurs negligible overhead for legitimate users, but has the added benefit that attackers are forced to encrypt their data in order to increase their bandwidth usage. additionally, if attackers have vastly more bandwidth than you, they can simply mount a traditional DoS attack anyways.

tl;dr just sort connections by recently used valid data traffic.

comment:13 Changed 18 months ago by Hello71

Priority: LowHigh

directory connections are poorly impacted by this metric, but:

  1. if the connection is legitimate, there will be data flowing down it soon after fetching directory information anyways. works better with the new ORPort-only architecture, but for legacy clients I guess we could just sum together the bandwidth used by an IP address and use that somehow
  1. AIUI directory connections are only absolutely necessary during the very first startup. at any later time, if a directory connection cannot be made or is suddenly terminated, cached data can be temporarily used until a connection can be re-established. therefore, prematurely terminating directory connections is not a huge problem, and is much better than rejecting new connections which may require relay service.

raising priority as discussed on IRC.

comment:14 Changed 18 months ago by teor

I wonder if we should try for a load-balancing metric instead.

A recent tor-relays thread discovered that limiting the connections from each IPv4 /16 resolves this issue:
https://lists.torproject.org/pipermail/tor-relays/2017-December/013776.html

The equivalent IPv6 netblock would be a /32, the minimum regional internet registry allocation block size.

We could identify the /16s or /32s with the largest numbers of connections, and kill those first, using one of the other "usefulness" heuristics.

comment:15 in reply to:  14 Changed 18 months ago by Hello71

Replying to teor:

I wonder if we should try for a load-balancing metric instead.

A recent tor-relays thread discovered that limiting the connections from each IPv4 /16 resolves this issue:
https://lists.torproject.org/pipermail/tor-relays/2017-December/013776.html

The equivalent IPv6 netblock would be a /32, the minimum regional internet registry allocation block size.

We could identify the /16s or /32s with the largest numbers of connections, and kill those first, using one of the other "usefulness" heuristics.

I think that might work, but I don't see why that would be any better than using only bandwidth consumed. In fact, I think that would have the same issue that I mentioned on the list of overkilling NATed clients, potentially the ones most in need of anonymity! IIRC, Cleanfeed is known to proxy all connections through a small number of IPs; I wouldn't be surprised if China, Iran and company did the same.

Filtering based on bandwidth used is reputation-neutral, has zero false positives, and has near-zero added cost.

Last edited 18 months ago by Hello71 (previous) (diff)

comment:16 Changed 17 months ago by Hello71

in other words, it's impossible to, using netblocks only, distinguish between "real" clients behind some mobile network's carrier-grade NAT and a bunch of regular clients on a VPS somewhere.

hm... upon further consideration though, perhaps it would be possible to use a memory-hard proof of work algorithm here. even phones under $100 have at least 2 GB of RAM, so completing an occasional 1 GB POW should only momentarily slow the device. should be easy on battery life too, unlike a CPU POW. I did a quick calculation and an attacker would need s = cfn, where s is the required server RAM, c is the challenge difficulty, f is the frequency, and n is the number of connections to be held, and if c = 1 GB * 3 sec, f = 1/10 min, n = 200, then s = 1 GB, or around $5/month per 200 connections, which seems sufficiently expensive to deter this particular attack. however, there are a number of downsides to this plan. not only does it require additional protocol design (time which could be spent doing something else, like IPv6 support), I hear the iOS Tor people are limited to 15 MB, so even if the device has 10 GB of RAM that won't help. I figure "you must reopen the Tor app every ten minutes to maintain your connection" is not a good solution.

hm... perhaps we could use both: clients that require long-running connections for things like IRC must submit proofs of work (either CPU or memory), and iOS clients just have to live with occasionally re-establishing their connections if the relay is under DoS.

regardless, in the short to medium term, we should probably implement the bandwidth method. this latest surge in fake clients has noticeably increased number of "users": https://metrics.torproject.org/userstats-relay-country.html?start=2017-12-01&end=2017-12-15&country=all&events=on. fortunately, performance has not yet been seriously affected, but it is plausible that they will get more servers online and the curve will continue going up.

in terms of implementation, is it correct that right now, only OR connections count recent traffic?

comment:17 Changed 16 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Mark a lot of assigned/needs_revision tickets as 0.3.4. If you think this should happen in 0.3.3 instead, just let me know?

comment:18 Changed 15 months ago by dgoulet

Keywords: tor-dos added; dos removed

Rename keyword "dos" to "tor-dos"

comment:19 Changed 14 months ago by nickm

Keywords: 034-triage-20180328 added

comment:20 Changed 14 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:21 Changed 14 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

Note: See TracTickets for help on using tickets.