wiki:org/teams/MetricsTeam/ObfuscationSimulationAnalysis

Obfuscation Simulation

We'd like to improve directory-request statistics by obfuscating values on relays before they are reported to the directory authorities. A possible obfuscation method is to add Laplace noise to request counts for all ~250 countries, so that it's unclear whether a request was actually made by a user or is just noise.

But before we do this we need to find out whether obfuscated values would still be useful enough to estimate user numbers in the Tor network. Let's run a simulation using archived descriptors.

Summary of findings and results

An early result of our analysis is that we need to be very careful how much noise we add to the statistics reported by single relays. If we use delta_f = 1 as parameter, user number estimates for any given country and day will vary by a few hundred users. This won't matter for countries with thousands or tens of thousands daily users including the top-10 countries by relay users. But it might affect the top-10 countries by bridge users, and it will quite certainly affect the top-10 countries by possible censorship events.

It's also worth noting that delta_f = 8 would be easier to justify as parameter, because it would hide roughly 1 user in the noise rather than just a single request. But with an absolute difference of a few thousand users per day and country, this amount of noise would render most per-country statistics useless.

Possible next steps:

  • Review the simulation code for any issues affecting simulation results.
    • Figure out why the absolute difference for relays and delta_f = 1 is not at x = 0 for y = 50%.
    • Figure out why a smaller bin_size of 1 rather than 8 increases variance for delta_f = 8 in case of bridges.
  • Suggest better estimation algorithms that can handle the added noise better.
  • Include noise in statistics reported by relays.

Method(s)

We're taking relay and bridge descriptors as input, rewriting them to include different amounts of noise, and run the user number estimation algorithm on obfuscated descriptors. Then we compare results with noise to results where we did not add any noise. As always, the code has more details.

Detailed results

Graph: Absolute difference to user number estimates per country and day when obfuscating directory-request statistics

CSV files

Code for simulation

git clone -b dirreqstats https://git.torproject.org/karsten/metrics-web.git
cd metrics-web
wget https://people.torproject.org/~karsten/volatile/libs-for-metrics-web.tar
tar xf libs-for-metrics-web.tar
mv lib shared/
cd modules/clients/
./run-simulation.sh     # this takes a while, and it produces quite some noise on the console!
R --slave -f compare-simulations.R

Related work

  • Karin Herm. Privacy analysis of Tor's in-memory statistics. Technical Report 2017-04-001, The Tor Project, April 2017. .pdf
  • David Goulet, Aaron Johnson, George Kadianakis, and Karsten Loesing. Hidden-service statistics reported by relays. Technical Report 2015-04-001, The Tor Project, April 2015. .pdf
  • George Kadianakis and Karsten Loesing. Extrapolating network totals from hidden-service statistics. Technical Report 2015-01-001, The Tor Project, January 2015. .pdf
Last modified 4 months ago Last modified on May 19, 2017, 9:42:31 AM

Attachments (2)

Download all attachments as: .zip