Karsten and I discussed this about a year ago, and came to the conclusion that rounding to 10k cells was sufficient, especially since these counts are accumulated over a full 24 hour period. Relays are already reporting higher resolution for BW read and write history, and relays that opt in have higher resolution for cell statistics too.
Is there a specific thing we're worried about with the current numbers?
Can we quantify the additional privacy we'd get from noise vs just making the rounding larger? Should we do one, or the other, or both?
Karsten and I discussed this about a year ago, and came to the conclusion that rounding to 10k cells was sufficient, especially since these counts are accumulated over a full 24 hour period. Relays are already reporting higher resolution for BW read and write history, and relays that opt in have higher resolution for cell statistics too.
Then we should (eventually) fix these higher resolution statistics by adding noise to them too.
Is there a specific thing we're worried about with the current numbers?
We are not adding noise, so we are relying on the other user activity being variable enough to hide an individual user's activity. There's no guarantee that will happen.
Here's one possible attack:
I want to detect the padding being used by a particular client, to see if it is connecting to a particular guard. I know the likely padding amount for this client.
I have some high-resolution non-noisy data figures available (for example, BW read and write history). I use these to estimate the final padding totals.
I manipulate the final padding totals for the guard to be just below a rounding threshold.
If the client connects, the guard reports a figure above the threshold. If the client does not, the guard reports a figure below the threshold.
I repeat steps 2-4 until I know with enough certainty whether the client is connecting. (This takes time that depends on the variability in the system.)
If I want to enhance this attack, I can use multiple statistics, or reduce the amount of variability in the system.
Can we quantify the additional privacy we'd get from noise vs just making the rounding larger? Should we do one, or the other, or both?
Rounding does not guarantee you any privacy. The larger the rounding amount, and the more variability in the system, the less likely any particular total will expose a user's activity, but there is always a chance that it will.
(But rounding is really good for grouping similar noisy figures, and helping people understand the precision of the data. That's why we should do it.)
You get guaranteed privacy from noise. The larger the noise, the larger the amount of user activity that is guaranteed to be hidden over a larger amount of time. You don't have to round to get this guarantee: adding noise is enough. You also don't have to rely on any other activity in the system to get this guarantee.
We should try to add enough noise to hide a single client's activity over 24 hours.
How many padding cells do you expect for a single idle client per day?
A recent measurement showed that ~20% of all entry connections are inactive, so we should try to hide at least an idle client's worth of activity:
We found that Tor has about 700 thousand unique clientsconnecting to the network during an average 10-minute in-terval. Compared to Tor’s own estimate of about 1.75 millionclients per day in May 2016 [4], this suggests that the clientpopulation turns over about 2.5 times a day. Somewhat sur-prisingly, we found that about 130 thousand clients have in-active circuits during an average 10 minutes
To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts, how often they make connections during a 24 hour interval, and what percentage of the time those connections are idle. Do we have this data?
Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.
To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts,
Using closed connections to count clients (and also not quite the time interval you're after):
"Tor has about 700 thousand unique clients connecting to the network during an average 10-minute interval."
how often they make connections during a 24 hour interval,
"Compared to Tor’s own estimate of about 1.75 million clients per day in May 2016..., this suggests that the client population turns over about 2.5 times a day."
and what percentage of the time those connections are idle. Do we have this data?
"Somewhat surprisingly, we found that about 130 thousand clients have in-
active circuits during an average 10 minutes."
(That is, closed connections with no circuits with more than 8 cells either sent or received.)
Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.
Pages 7-8 of the PrivCount paper give the theory behind differential noise.
I am not sure where to find anything similar in the tor code.
When we add noise, we've done it inconsistently and arbitrarily in the past.
To know how many padding cells to expect for a client, we need information on how long an average client's connection lasts,
Using closed connections to count clients (and also not quite the time interval you're after):
"Tor has about 700 thousand unique clients connecting to the network during an average 10-minute interval."
This doesn't tell us how long a single client remains connected on average.
how often they make connections during a 24 hour interval,
"Compared to Tor’s own estimate of about 1.75 million clients per day in May 2016..., this suggests that the client population turns over about 2.5 times a day."
This doesn't tell us how many connections a single makes in a day.
and what percentage of the time those connections are idle. Do we have this data?
"Somewhat surprisingly, we found that about 130 thousand clients have in-
active circuits during an average 10 minutes."
(That is, closed connections with no circuits with more than 8 cells either sent or received.)
This doesn't tell us how long a single connection remains idle, on average.
Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.
Pages 7-8 of the PrivCount paper give the theory behind differential noise.
I am not sure where to find anything similar in the tor code.
When we add noise, we've done it inconsistently and arbitrarily in the past.
Right now, it looks like this is where we're headed here, too.
Also, is there a good example of where we add noise in a way successfully calculates how to hide a single client's activity? It would be helpful to have a reference to work off of.
Here's how it's done in practice:
Collect the statistics on a relay without noise, and without publishing them
Use the statistics to estimate individual client usage
Erase the detailed outputs of the non-noisy statistics collection
Add noise sufficient to hide a single client's activity (that is, make the average? amount of noise added at least as much as the individual client usage estimate)
That should work in this case, too: but we would also need to estimate client numbers for that relay, which we could do using unique connecting IP addresses and channel_is_client(). Or we could use existing Tor client statistics and multiply them by the fraction of guard consensus weight assigned to the relay.
Pages 7-8 of the PrivCount paper give the theory behind differential noise.
I am not sure where to find anything similar in the tor code.
When we add noise, we've done it inconsistently and arbitrarily in the past.
Right now, it looks like this is where we're headed here, too.
Perhaps Rob or Aaron can help?
Some of Aaron's upcoming research can measure individual client usage over long timescales, but PrivCount can't, because it's not safe to keep client IP addresses in memory for long periods of time.
I am tagging this as guard discovery so we can compare it to related attacks and prioritize appropriately. I am not convinced it is as severe as the other attacks we enumerated in Wilmington (which I am also working on filing and/or tagging).
These statistics will have shorter-lived directory guard connections blended into their statistics, though (and we don't try to pad directory guard connections).