Opened 2 years ago

Last modified 3 months ago

#22422 assigned defect

Add noise to PaddingStatistics

Reported by: teor Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.1.1-alpha
Severity: Normal Keywords: privcount, guard-discovery, 034-triage-20180328
Cc: karsten, mikeperry, robgjansen, amj703 Actual Points:
Parent ID: Points: 0.5
Reviewer: Sponsor: SponsorQ

Description

It's safer to publish statistics if they have noise added.

Even though we round the totals, that's not enough to ensure privacy for a certain amount of user activity without added noise.

We need to fix this before 0.3.1 becomes stable.

Child Tickets

Change History (19)

comment:1 Changed 2 years ago by nickm

Priority: MediumHigh

comment:2 Changed 2 years ago by mikeperry

Cc: karsten added

Karsten and I discussed this about a year ago, and came to the conclusion that rounding to 10k cells was sufficient, especially since these counts are accumulated over a full 24 hour period. Relays are already reporting higher resolution for BW read and write history, and relays that opt in have higher resolution for cell statistics too.

Is there a specific thing we're worried about with the current numbers?

Can we quantify the additional privacy we'd get from noise vs just making the rounding larger? Should we do one, or the other, or both?

comment:3 in reply to:  2 Changed 2 years ago by teor

Replying to mikeperry:

Karsten and I discussed this about a year ago, and came to the conclusion that rounding to 10k cells was sufficient, especially since these counts are accumulated over a full 24 hour period. Relays are already reporting higher resolution for BW read and write history, and relays that opt in have higher resolution for cell statistics too.

Then we should (eventually) fix these higher resolution statistics by adding noise to them too.

Is there a specific thing we're worried about with the current numbers?

We are not adding noise, so we are relying on the other user activity being variable enough to hide an individual user's activity. There's no guarantee that will happen.

Here's one possible attack:

  1. I want to detect the padding being used by a particular client, to see if it is connecting to a particular guard. I know the likely padding amount for this client.
  1. I have some high-resolution non-noisy data figures available (for example, BW read and write history). I use these to estimate the final padding totals.
  1. I manipulate the final padding totals for the guard to be just below a rounding threshold.
  1. If the client connects, the guard reports a figure above the threshold. If the client does not, the guard reports a figure below the threshold.
  1. I repeat steps 2-4 until I know with enough certainty whether the client is connecting. (This takes time that depends on the variability in the system.)

If I want to enhance this attack, I can use multiple statistics, or reduce the amount of variability in the system.

Can we quantify the additional privacy we'd get from noise vs just making the rounding larger? Should we do one, or the other, or both?

Rounding does not guarantee you any privacy. The larger the rounding amount, and the more variability in the system, the less likely any particular total will expose a user's activity, but there is always a chance that it will.

(But rounding is really good for grouping similar noisy figures, and helping people understand the precision of the data. That's why we should do it.)

You get guaranteed privacy from noise. The larger the noise, the larger the amount of user activity that is guaranteed to be hidden over a larger amount of time. You don't have to round to get this guarantee: adding noise is enough. You also don't have to rely on any other activity in the system to get this guarantee.

comment:4 Changed 2 years ago by teor

We should try to add enough noise to hide a single client's activity over 24 hours.

How many padding cells do you expect for a single idle client per day?

A recent measurement showed that ~20% of all entry connections are inactive, so we should try to hide at least an idle client's worth of activity:

We found that Tor has about 700 thousand unique clients
connecting to the network during an average 10-minute in-
terval. Compared to Tor’s own estimate of about 1.75 million
clients per day in May 2016 [4], this suggests that the client
population turns over about 2.5 times a day. Somewhat sur-
prisingly, we found that about 130 thousand clients have in-
active circuits during an average 10 minutes

Source:
Section 5.3, Page 11 in http://www.robgjansen.com/publications/privcount-ccs2016.pdf

comment:5 Changed 2 years ago by nickm

Cc: mikeperry added

comment:6 Changed 2 years ago by mikeperry

To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts, how often they make connections during a 24 hour interval, and what percentage of the time those connections are idle. Do we have this data?

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

comment:7 in reply to:  6 ; Changed 2 years ago by teor

Cc: robgjansen amj703 added
Keywords: privcount added

Replying to mikeperry:

To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts,

Using closed connections to count clients (and also not quite the time interval you're after):
"Tor has about 700 thousand unique clients connecting to the network during an average 10-minute interval."

how often they make connections during a 24 hour interval,

"Compared to Tor’s own estimate of about 1.75 million clients per day in May 2016..., this suggests that the client population turns over about 2.5 times a day."

and what percentage of the time those connections are idle. Do we have this data?

"Somewhat surprisingly, we found that about 130 thousand clients have in-
active circuits during an average 10 minutes."

(That is, closed connections with no circuits with more than 8 cells either sent or received.)

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

Pages 7-8 of the PrivCount paper give the theory behind differential noise.
I am not sure where to find anything similar in the tor code.
When we add noise, we've done it inconsistently and arbitrarily in the past.

Perhaps Rob or Aaron can help?

comment:8 Changed 2 years ago by teor

Also, feel free to ping me by email, for some reason the spam filter is eating your replies to this ticket.

comment:9 in reply to:  7 ; Changed 2 years ago by mikeperry

Replying to teor:

Replying to mikeperry:

To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts,

Using closed connections to count clients (and also not quite the time interval you're after):
"Tor has about 700 thousand unique clients connecting to the network during an average 10-minute interval."

This doesn't tell us how long a *single* client remains connected on average.

how often they make connections during a 24 hour interval,

"Compared to Tor’s own estimate of about 1.75 million clients per day in May 2016..., this suggests that the client population turns over about 2.5 times a day."

This doesn't tell us how many connections a *single* makes in a day.

and what percentage of the time those connections are idle. Do we have this data?

"Somewhat surprisingly, we found that about 130 thousand clients have in-
active circuits during an average 10 minutes."

(That is, closed connections with no circuits with more than 8 cells either sent or received.)

This doesn't tell us how *long* a single connection remains idle, on average.

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

Pages 7-8 of the PrivCount paper give the theory behind differential noise.
I am not sure where to find anything similar in the tor code.
When we add noise, we've done it inconsistently and arbitrarily in the past.

Right now, it looks like this is where we're headed here, too.

Perhaps Rob or Aaron can help?

I'm hoping Karsten can as well.

comment:10 in reply to:  9 Changed 2 years ago by teor

Replying to mikeperry:

Replying to teor:

Replying to mikeperry:

...

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

Here's how it's done in practice:

  1. Collect the statistics on a relay without noise, and without publishing them
  2. Use the statistics to estimate individual client usage
  3. Erase the detailed outputs of the non-noisy statistics collection
  4. Add noise sufficient to hide a single client's activity (that is, make the average? amount of noise added at least as much as the individual client usage estimate)

That should work in this case, too: but we would also need to estimate client numbers for that relay, which we could do using unique connecting IP addresses and channel_is_client(). Or we could use existing Tor client statistics and multiply them by the fraction of guard consensus weight assigned to the relay.

Pages 7-8 of the PrivCount paper give the theory behind differential noise.
I am not sure where to find anything similar in the tor code.
When we add noise, we've done it inconsistently and arbitrarily in the past.

Right now, it looks like this is where we're headed here, too.

Perhaps Rob or Aaron can help?

Some of Aaron's upcoming research can measure individual client usage over long timescales, but PrivCount can't, because it's not safe to keep client IP addresses in memory for long periods of time.

I'm hoping Karsten can as well.

I'd like Karsten to check the steps I suggested.

comment:11 Changed 2 years ago by mikeperry

Keywords: guard-discovery added

I am tagging this as guard discovery so we can compare it to related attacks and prioritize appropriately. I am not convinced it is as severe as the other attacks we enumerated in Wilmington (which I am also working on filing and/or tagging).

comment:12 Changed 2 years ago by mikeperry

asn pointed out that this paper has some information on client connection lifetimes, at least: http://www.icir.org/johanna/papers/pam16tor.pdf

These statistics will have shorter-lived directory guard connections blended into their statistics, though (and we don't try to pad directory guard connections).

comment:13 Changed 2 years ago by nickm

Owner: set to mikeperry
Status: newassigned

Assigning this one to mike, but I'd like teor and karsten to also take responsibility for continuing the discussion.

comment:14 Changed 2 years ago by teor

Status: assignedneeds_information

I don't think this is as high a priority as known guard discovery attacks that are happening in the wild.

I think we could wait to add noise until we do it systematically across our codebase.

I'll leave it to mike to make the final call.

comment:15 Changed 2 years ago by nickm

Sponsor: SponsorQ

comment:16 Changed 19 months ago by teor

Milestone: Tor: 0.3.1.x-finalTor: 0.3.4.x-final

These feature and bugfix tickets have no patches. The earliest they will get done is 0.3.4.

comment:17 Changed 18 months ago by nickm

Keywords: 034-triage-20180328 added

comment:18 Changed 17 months ago by teor

Milestone: Tor: 0.3.4.x-finalTor: unspecified

This ticket is not on our 6 month roadmap.

comment:19 Changed 3 months ago by gaba

Owner: mikeperry deleted
Status: needs_informationassigned
Note: See TracTickets for help on using tickets.