Changes between Version 4 and Version 5 of org/teams/MetricsTeam/ObfuscationSimulationAnalysis


Ignore:
Timestamp:
May 19, 2017, 9:42:31 AM (2 years ago)
Author:
karsten
Comment:

Describe results a bit more.

Legend:

Unmodified
Added
Removed
Modified
  • org/teams/MetricsTeam/ObfuscationSimulationAnalysis

    v4 v5  
    77== Summary of findings and results ==
    88
     9An early result of our analysis is that we need to be very careful how much noise we add to the statistics reported by single relays.  If we use `delta_f = 1` as parameter, user number estimates for any given country and day will vary by a few hundred users.  This won't matter for countries with thousands or tens of thousands daily users including the [https://metrics.torproject.org/userstats-relay-table.html top-10 countries by relay users].  But it might affect the [https://metrics.torproject.org/userstats-bridge-table.html top-10 countries by bridge users], and it will quite certainly affect the [https://metrics.torproject.org/userstats-censorship-events.html top-10 countries by possible censorship events].
     10
     11It's also worth noting that `delta_f = 8` would be easier to justify as parameter, because it would hide roughly 1 user in the noise rather than just a single request.  But with an absolute difference of a few thousand users per day and country, this amount of noise would render most per-country statistics useless.
     12
     13Possible next steps:
     14 - Review the simulation code for any issues affecting simulation results.
     15   - Figure out why the absolute difference for relays and `delta_f = 1` is not at `x = 0` for `y = 50%`.
     16   - Figure out why a smaller `bin_size` of 1 rather than 8 increases variance for `delta_f = 8` in case of bridges.
     17 - Suggest better estimation algorithms that can handle the added noise better.
     18 - Include noise in statistics reported by relays.
     19
    920== Method(s) ==
     21
     22We're taking relay and bridge descriptors as input, rewriting them to include different amounts of noise, and run the user number estimation algorithm on obfuscated descriptors.  Then we compare results with noise to results where we did not add any noise.  As always, the code has more details.
    1023
    1124== Detailed results ==