Add noise to PaddingStatistics

changed milestone to %Tor: unspecified

added 034-triage-20180328 component::core tor/tor guard-discovery milestone::Tor: unspecified points::0.5 priority::high privcount severity::normal sponsor::Q status::new type::defect version::tor 0.3.1.1-alpha labels

Trac:
Priority: Medium to High

Karsten and I discussed this about a year ago, and came to the conclusion that rounding to 10k cells was sufficient, especially since these counts are accumulated over a full 24 hour period. Relays are already reporting higher resolution for BW read and write history, and relays that opt in have higher resolution for cell statistics too.

Is there a specific thing we're worried about with the current numbers?

Can we quantify the additional privacy we'd get from noise vs just making the rounding larger? Should we do one, or the other, or both?

Trac:
Cc: N/A to karsten

Replying to mikeperry:

Karsten and I discussed this about a year ago, and came to the conclusion that rounding to 10k cells was sufficient, especially since these counts are accumulated over a full 24 hour period. Relays are already reporting higher resolution for BW read and write history, and relays that opt in have higher resolution for cell statistics too.

Then we should (eventually) fix these higher resolution statistics by adding noise to them too.

Is there a specific thing we're worried about with the current numbers?

We are not adding noise, so we are relying on the other user activity being variable enough to hide an individual user's activity. There's no guarantee that will happen.

Here's one possible attack:

I want to detect the padding being used by a particular client, to see if it is connecting to a particular guard. I know the likely padding amount for this client.
I have some high-resolution non-noisy data figures available (for example, BW read and write history). I use these to estimate the final padding totals.
I manipulate the final padding totals for the guard to be just below a rounding threshold.
If the client connects, the guard reports a figure above the threshold. If the client does not, the guard reports a figure below the threshold.
I repeat steps 2-4 until I know with enough certainty whether the client is connecting. (This takes time that depends on the variability in the system.)

If I want to enhance this attack, I can use multiple statistics, or reduce the amount of variability in the system.

Can we quantify the additional privacy we'd get from noise vs just making the rounding larger? Should we do one, or the other, or both?

Rounding does not guarantee you any privacy. The larger the rounding amount, and the more variability in the system, the less likely any particular total will expose a user's activity, but there is always a chance that it will.

(But rounding is really good for grouping similar noisy figures, and helping people understand the precision of the data. That's why we should do it.)

You get guaranteed privacy from noise. The larger the noise, the larger the amount of user activity that is guaranteed to be hidden over a larger amount of time. You don't have to round to get this guarantee: adding noise is enough. You also don't have to rely on any other activity in the system to get this guarantee.

We should try to add enough noise to hide a single client's activity over 24 hours.

How many padding cells do you expect for a single idle client per day?

A recent measurement showed that ~20% of all entry connections are inactive, so we should try to hide at least an idle client's worth of activity:

We found that Tor has about 700 thousand unique clients
connecting to the network during an average 10-minute in-
terval. Compared to Tor’s own estimate of about 1.75 million
clients per day in May 2016 [4], this suggests that the client
population turns over about 2.5 times a day.  Somewhat sur-
prisingly, we found that about 130 thousand clients have in-
active circuits during an average 10 minutes

Source: Section 5.3, Page 11 in http://www.robgjansen.com/publications/privcount-ccs2016.pdf

Trac:
Cc: karsten to karsten, mikeperry

To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts, how often they make connections during a 24 hour interval, and what percentage of the time those connections are idle. Do we have this data?

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

Replying to mikeperry:

To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts,

Using closed connections to count clients (and also not quite the time interval you're after): "Tor has about 700 thousand unique clients connecting to the network during an average 10-minute interval."

how often they make connections during a 24 hour interval,

"Compared to Tor’s own estimate of about 1.75 million clients per day in May 2016..., this suggests that the client population turns over about 2.5 times a day."

and what percentage of the time those connections are idle. Do we have this data?

"Somewhat surprisingly, we found that about 130 thousand clients have in- active circuits during an average 10 minutes."

(That is, closed connections with no circuits with more than 8 cells either sent or received.)

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

Pages 7-8 of the PrivCount paper give the theory behind differential noise. I am not sure where to find anything similar in the tor code. When we add noise, we've done it inconsistently and arbitrarily in the past.

Perhaps Rob or Aaron can help?

Trac:
Keywords: N/A deleted, privcount added
Cc: karsten, mikeperry to karsten, mikeperry, robgjansen, amj703

Also, feel free to ping me by email, for some reason the spam filter is eating your replies to this ticket.

Replying to teor:

Replying to mikeperry:

To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts,

Using closed connections to count clients (and also not quite the time interval you're after): "Tor has about 700 thousand unique clients connecting to the network during an average 10-minute interval."

This doesn't tell us how long a single client remains connected on average.

how often they make connections during a 24 hour interval,

"Compared to Tor’s own estimate of about 1.75 million clients per day in May 2016..., this suggests that the client population turns over about 2.5 times a day."

This doesn't tell us how many connections a single makes in a day.

and what percentage of the time those connections are idle. Do we have this data?

"Somewhat surprisingly, we found that about 130 thousand clients have in- active circuits during an average 10 minutes."

(That is, closed connections with no circuits with more than 8 cells either sent or received.)

This doesn't tell us how long a single connection remains idle, on average.

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

Pages 7-8 of the PrivCount paper give the theory behind differential noise. I am not sure where to find anything similar in the tor code. When we add noise, we've done it inconsistently and arbitrarily in the past.

Right now, it looks like this is where we're headed here, too.

Perhaps Rob or Aaron can help?

I'm hoping Karsten can as well.

Replying to mikeperry:

Replying to teor:

Replying to mikeperry: ...

Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.

Here's how it's done in practice:

Collect the statistics on a relay without noise, and without publishing them
Use the statistics to estimate individual client usage
Erase the detailed outputs of the non-noisy statistics collection
Add noise sufficient to hide a single client's activity (that is, make the average? amount of noise added at least as much as the individual client usage estimate)

That should work in this case, too: but we would also need to estimate client numbers for that relay, which we could do using unique connecting IP addresses and channel_is_client(). Or we could use existing Tor client statistics and multiply them by the fraction of guard consensus weight assigned to the relay.

Pages 7-8 of the PrivCount paper give the theory behind differential noise. I am not sure where to find anything similar in the tor code. When we add noise, we've done it inconsistently and arbitrarily in the past.

Right now, it looks like this is where we're headed here, too.

Perhaps Rob or Aaron can help?

Some of Aaron's upcoming research can measure individual client usage over long timescales, but PrivCount can't, because it's not safe to keep client IP addresses in memory for long periods of time.

I'm hoping Karsten can as well.

I'd like Karsten to check the steps I suggested.

I am tagging this as guard discovery so we can compare it to related attacks and prioritize appropriately. I am not convinced it is as severe as the other attacks we enumerated in Wilmington (which I am also working on filing and/or tagging).

Trac:
Keywords: N/A deleted, guard-discovery added

asn pointed out that this paper has some information on client connection lifetimes, at least: http://www.icir.org/johanna/papers/pam16tor.pdf

These statistics will have shorter-lived directory guard connections blended into their statistics, though (and we don't try to pad directory guard connections).

Assigning this one to mike, but I'd like teor and karsten to also take responsibility for continuing the discussion.

Trac:
Owner: N/A to mikeperry
Status: new to assigned

I don't think this is as high a priority as known guard discovery attacks that are happening in the wild.

I think we could wait to add noise until we do it systematically across our codebase.

I'll leave it to mike to make the final call.

Trac:
Status: assigned to needs_information

Trac:
Sponsor: N/A to SponsorQ

These feature and bugfix tickets have no patches. The earliest they will get done is 0.3.4.

Trac:
Milestone: Tor: 0.3.1.x-final to Tor: 0.3.4.x-final

Trac:
Keywords: N/A deleted, 034-triage-20180328 added

This ticket is not on our 6 month roadmap.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: unspecified

Trac:
Owner: mikeperry to N/A
Status: needs_information to assigned

Change tickets that are assigned to nobody to "new".

Trac:
Status: assigned to new

changed time estimate to 4h

mentioned in issue #22729 (moved)

moved to tpo/core/tor#22422 (closed)

mentioned in issue tpo/core/tor#22729 (closed)

Add noise to PaddingStatistics

Child items ...

Activity