Keep in mind that this ticket is only about the two current statistics explained in the report linked above. Future experimental statistics could use a second option to enable "experimental stats" instead of the current torrc option that is considered to enable the stable-safe ones (see child ticket for that).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
It may be useful to note which version that this became available on its tor-manual entry. I just tried to use it without noticing that the Ubuntu version is 0.2.4.20
George, David, and I think this is a great plan. The reasoning is that we are comfortable with the current two statistics, and they are valuable to have. We aren't making the decision here for all future stats, just for these two -- in the future when we have more, and we're not sure about them, we will turn to #15255 (moved).
Nick, can you remind us of your remaining objections to this ticket?
The first statistic is the number of cells on rendezvous circuits observed by a rendezvous point, and the second is the number of unique .onion addresses observed by a hidden-service directory.
As long as appropriate noise is added, I don't much object to having these two on-by-default. They seem quite harmless. Was there a counterargument I was supposed to remember? :)
As evidenced by the the fact that very few relays have these enabled, the vast majority of relay operators are either unaware of these statistics, or having been made aware of them, have chosen not to enable them. Even software without a specific purpose of giving people more privacy asks users before gathering statistics on them. There is no benefit to the users on whom the statistics are being collected without notification, knowledge, or agreement, from either of these measurements, which are measured and published per-relay, not in aggregate. The existing bandwidth statistics are already dangerous in terms of guard discovery (#13988 (moved)) and combining them with the statistics in this ticket enables easier attacks. There is no kind of privacy policy to notify users, as is required by (at least) EU law.
Finally, it hasn't been publicly explained anywhere what benefit, if any, there is to anyone, even DARPA, of increasing the precision of these statistics. What purpose are the current measurements from a limited number of relays are too noisy for, that less noisy ones won't be?
I'm assuming this is you, qwerty1. Please let me know if it isn't. :)
As evidenced by the the fact that very few relays have these enabled, the vast majority of relay operators are either unaware of these statistics, or having been made aware of them, have chosen not to enable them. Even software without a specific purpose of giving people more privacy asks users before gathering statistics on them. There is no benefit to the users on whom the statistics are being collected without notification, knowledge, or agreement, from either of these measurements, which are measured and published per-relay, not in aggregate. The existing bandwidth statistics are already dangerous in terms of guard discovery (#13988 (moved)) and combining them with the statistics in this ticket enables easier attacks.
What attacks are you thinking of? Or are you saying that there may be unforseen attacks in general, even if we can't think of them today?
While I share your concern about statistics gathering in general, these particular statistics aren't statistics about users as far as I can see, but about total bandwidth and descriptor counts. If I believed that they were about users, or that they posed a privacy threat, I would not support having them merged into Tor at all, let alone on-by-default.
There is no kind of privacy policy to notify users, as is required by (at least) EU law.
Are you really sure this is what EU law says? No information about users is collected; the only information stored here is total number of rendezvous cells across all circuits, and the total number of .onion addresses seen.
I'd be happy to try to check with an EU lawyer here.
Finally, it hasn't been publicly explained anywhere what benefit, if any, there is to anyone, even DARPA, of increasing the precision of these statistics. What purpose are the current measurements from a limited number of relays are too noisy for, that less noisy ones won't be?
David/Roger -- any info here? You're the ones who've been asking for this.
Replying to cypherpunks:
are you saying that there may be unforseen attacks in general, even if we can't think of them today?
Yes. We don't know how adding these to every relay's extrainfo (this ticket proposes a change from opt-in to opt-out, behind relay operators' backs, without justification or explanation beyond "in order to have" the change) could be harmful.
Let me start by saying that IMO these stats are useful to us:
During the past months, I've used the HS bandwidth graphmany times when talking to people to debunk the "darkweb iceberg" myth. I know that other outreach people have used it the same way as well.
I've also found the number of hidden services graph useful for figuring out the size of the onion space. For example, when we are discussing padding techniques to defeat traffic fingerprinting for hidden services, we can now calculate approximately what's the total overhead of the whole network is going to be, since we know approximately the number of hidden services. This figure is also helpful for figuring out good values for the number of IPs, HSDirs, etc.
So, as I understand it, here are some reasons for turning this on by default.
Both of these stats have very low coverage right now. The HSDir stat for example is about to go below 1% coverage which means that it will stop working (we require at least 1% coverage for these stats).
This is because not many relay operators are running the stats and some that did have disabled them because they didn't know if they work or if they are useful.
By turning this on by default, we stop hunting down relay operators to ask them to enable stats.
I'm also afraid that this low coverage might be causing our extrapolations to lie. For example, I could imagine that the bandwidth graph does not work very well with that low coverage since most clients that cause lots of traffic connect to relays that are not reporting statistics. This might be why the spikes are so big on that graph, since some days we might be lucky and get those busy clients in the stats. I think it will be very interesting to see how the graph develops when more of the network reports these stats (and we stop relying so much on our crazy extrapolation from 1%).
FWIW, I don't think DARPA actually cares about the precision of these statistics that much. And it's also unclear if the precision will be much better since now all 6000 relays will add laplace noise, whereas before only the 40 relays that reported these stats did (let's just hope that the noise cancels itself out as it should).
I think this is mainly something we do for ourselves, otherwise these stats will stop working in some time (when a few more HSDirs join the network), and we will have to start asking people again to turn it on.
Finally, we have thought about these stats more than we have any other stat, and we still can't find any nasty attacks that can come from them. That's why we are OK with enabling them by default.
In any case, I don't really care about this so strongly. I'm personally fine with disabling these stats again if the community thinks so. But it would be nice to hear a convincing argument from qwerty1 or whatever on why (he used to say that these stats can be used to do guard discovery attacks, which makes no sense!).
Similar to ExitPortStatistics, these involve examining and processing user-controlled data in ways that aren't necessary for normal operation. Similar to ExitPortStatistics (which aren't), they should not be enabled by default.
Any reasonably imaginable use of these is satisfiable without having them on by default. In fact, for those purposes mentioned, a single approximate measurement is enough.
arma: ping? I think that George's post above is pretty persuasive to me, but I would like you to chime in before we ship with this. Otherwise I can revert it for 0.2.7.3-alpha and we can revisit it when you have more time.
As evidenced by the the fact that very few relays have these enabled, the vast majority of relay operators are either unaware of these statistics, or having been made aware of them, have chosen not to enable them.
The vast majority of relay operators are unaware of the statistics. I could go to tor-relays and ask people to add the config lines to their torrc, and enough would, for a while. That's how we got the stats that we have so far. But that approach is not scalable, especially when we could just give them a reasonable default.
There is no benefit to the users on whom the statistics are being collected
I think asn clearly described benefits. Users have already benefited greatly, in that the IETF decision to reserve .onion took these two stats into account, and we've averted several further terrible (and inaccurate) news stories. Those are great arguments to me.
The existing bandwidth statistics are already dangerous in terms of guard discovery (#13988 (moved)) and combining them with the statistics in this ticket enables easier attacks.
Like what? I know there's always the risk of attacks that we haven't thought of, but here there are concrete upsides, and a bunch of smart people who have thought about the issue and think it'll be fine. We're in much better shape here than we are in a lot of other areas of Tor (since there's so much to cover, and it's so complex, and there aren't enough smart people to go around).
There is no kind of privacy policy to notify users, as is required by (at least) EU law.
I'm sorry, but this part is just nonsense. I'm a fan of doing the right thing for ethical and moral reasons, but inaccurately pointing to laws doesn't help things here.
Finally, it hasn't been publicly explained anywhere what benefit, if any, there is to anyone, even DARPA, of increasing the precision of these statistics. What purpose are the current measurements from a limited number of relays are too noisy for, that less noisy ones won't be?
I think asn answered this one well too -- the 1% threshold is a lower bound below which we know the extrapolations are garbage. I think his intuition is spot-on that when we extrapolate from 1.1%, we're very close to the point where we shouldn't trust the results.
(Also, indeed, Darpa is not pushing us about the precision here, just as they didn't actively push us about the stats in the first place. We decided that these are two stats that would be useful, and we convinced some of the funders that they would be useful and not harmful, and we turned out to be right so far.)
Not qwerty1 here, but I could imagine scenarios where bandwidth anomalies could allow correlation (low-use relays have large bandwidth spikes).
Right -- this is why we only aggregate stats when the relay is used in the 'rendezvous point' position -- which does not know the onion service, does not know the client, and is picked transiently for just that one circuit. So there should be nothing to correlate it to, and even if there is, it shouldn't help anybody learn anything useful.
Is there anything to be gained by collecting stats from all nodes as opposed to collecting stats from the top 5-10% bandwidth nodes?
Robustness of the result. Right now we are only guessing about how the millions of clients out there choose their paths. We don't know what Tor versions people are using, or whether there's some other implementation out there that chooses paths with different weights. And if things change, we won't know that either. Measuring all those other things, in order to be more confident that we only need the top relays here, seems like an unwise tradeoff.
Similar to ExitPortStatistics, these involve examining and processing user-controlled data in ways that aren't necessary for normal operation. Similar to ExitPortStatistics (which aren't), they should not be enabled by default.
But I can also point to other stats, which are on by default, like the per-country client counts, or the per-country directory fetch counts, or the deciles for how long consensus fetches take to finish. I think the stats under discussion here are squarely in the latter category -- and probably quite a bit safer than some of the stats that we collect, and have collected basically forever.
Any reasonably imaginable use of these is satisfiable without having them on by default. In fact, for those purposes mentioned, a single approximate measurement is enough.
Alas, I disagree with this part too. Having longitudinal answers is critical to knowing whether we're interpreting the data correctly, and to knowing whether external changes have affected these statistics.
As just one example, when the botnet showed up in August 2013, my first thought was that the client count graph was broken or was being fed inaccurate data by some jerk who was just messing with our stats. But when I saw a corresponding increase in directory bytes (https://metrics.torproject.org/dirbytes.html) it became much more convincing that indeed we had an extra 5 million clients that week. And it turns out that the 'directory bytes fetched' graph has continued to be useful for us to understand the trends of the botnet after that month.
Imprecise knowledge about users is part of being an anonymity network. Tor is for the privacy and security of its users. It is not for the miscellaneous benefit of sponsors or researchers. Or maybe you think it is, perhaps quite strongly, in which case you should be more public about that and Tor should be forked ASAP.
arma: ping? I think that George's post above is pretty persuasive to me, but I would like you to chime in before we ship with this. Otherwise I can revert it for 0.2.7.3-alpha and we can revisit it when you have more time.
Thanks for being thorough here Nick -- we indeed do not want to do things that are on balance harmful to users. It's always a balance, and it's important to think these things through.
I encourage everybody to keep paying attention to these topics, and to contribute to the tech reports and other analysis that folks are doing. In particular, I know that many research groups are currently working on ways to deploy Privex safely on Tor exit relays:
http://www0.cs.ucl.ac.uk/staff/G.Danezis/papers/ccsfp099-elahiA.pdf
And there is a lot of work to be done there, both in understanding what is safe, and in developing community norms and expectations.
Both of these stats have very low coverage right now.The HSDir stat for example is about to go below 1% coverage which means that it will stop working (we require at least 1% coverage for these stats).
That threshold could have been more or less than 1% depending on how many enabled them initially. The extrapolated values from 1% aren't that far away from the values extrapolated from a single relay.
This is because not many relay operators are running the stats and some that did have disabled them because they didn't know if they work or if they are useful.
Why do more relays enable one of the stats than the other?
By turning this on by default, we stop hunting down relay operators to ask them to enable stats.
IMO there isn't a problem with sending the same email as before.
I'm also afraid that this low coverage might be causing our extrapolations to lie. For example, I could imagine that the bandwidth graph does not work very well with that low coverage since most clients that cause lots of traffic connect to relays that are not reporting statistics. This might be why the spikes are so big on that graph, since some days we might be lucky and get those busy clients in the stats. I think it will be very interesting to see how the graph develops when more of the network reports these stats (and we stop relying so much on our crazy extrapolation from 1%).
Those relays handle more than a 1% share of bandwidth. The graphs show those extrapolated values have varied relatively little over time. Spikes can have many causes. The values are not far off the real ones, unless they're being counted wrong, but they're also in rough agreement with other estimates by unrelated methods.
Replying to cypherpunks:
We don't know what Tor versions people are using, or whether there's some other implementation out there that chooses paths with different weights.
Or some implementation that purposely chooses a RP without these stats on, whether that is the default or not.
The existing bandwidth statistics are already dangerous in terms of guard discovery (#13988 (moved)) and combining them with the statistics in this ticket enables easier attacks.
Like what? I know there's always the risk of attacks that we haven't thought of, but here there are concrete upsides, and a bunch of smart people who have thought about the issue and think it'll be fine. We're in much better shape here than we are in a lot of other areas of Tor (since there's so much to cover, and it's so complex, and there aren't enough smart people to go around).
If Eve knows Alice is going to download a large chunk of data from Bob soon, of either one large size or another, but not what RP they will use, or even what Bob's HS address is, Eve does get extra information from these statistics. If a group of people is going to start using Ricochet soon on a date Eve doesn't know, again these stats leak information, in an unexpected way.
This might be too radical.. but... what about, where the operator hasn't specified a preference yet in torrc, enabling it on a random selection of some of those relays, say 10%. That would give the benefits, without raising so many concerns about privacy, safety, and so on. It could be increased later, if that is still not robust enough.
enabling it on a random selection of some of those relays, say 10%.
Erm, either a config option is safe or not - said that, you can't be pregnant by just 10%.
I prepared some data sets that might help us gauge how realistic the attacks described above are. I'm talking specifically about the ricochet signal detection of comment:28, and the inactive relay infoleak of comment:18 .
Here is a list of measurements per date of September 2015:
https://people.torproject.org/~asn/statsattack/hs_stats_per_date.txt
For example you can see that on the 20th we received 74 reports with HS stats. One of those relays reported 25504922 rendezvous cells passing through it. We could use this data set to evaluate how likely it is to distinguish signals of various size when we consider the whole Tor network reporting such stats
Here is a list of measurements that each relay reported during September 2015:
https://people.torproject.org/~asn/statsattack/hs_stats_per_relay.txt
For example, the relay AlBasala did 21 measurements during September. It might be of interest to see how its measurements changed over time. We could use this dataset to find weak relays and see what kind of data is produced.
Of course, this is just a restricted set and it does not represent the full reality, but it might help us understand more this problem.
It is impossible that we will fix all 277 currently open 028 tickets before 028 releases. Time to move some out. This is my first pass through the "new" and "reopened" tickets, looking for things to move to ???.
Trac: Milestone: Tor: 0.2.8.x-final to Tor: 0.2.???