Make entropy-over-time graph

added component::metrics/analysis parent::6460 priority::medium resolution::wontfix status::closed type::enhancement labels

For further motivation/background, see https://blog.torproject.org/blog/research-problem-measuring-safety-tor-network

Sounds like a fun task. Damian, if you give me some numbers in a .csv file, I'm happy to write the graphing code in R. All code for this task could live in task-6232/ in metrics-tasks.git.

I'm moving this ticket to the Analysis component and changing the summary. The first step should be to extract data and manually make graphs to see if they tell us what we want to know. Automating everything and extending the metrics website is more complex than it may seem and should be step two.

Trac:
Summary: Add entropy-over-time graph to network page to Make entropy-over-time graph
Component: Metrics Website to Analysis

Trac:
Cc: atagar to atagar, gsathya

https://github.com/gsathya/tor-measurements/blob/master/pyentropy.py

atagar - Can you please check if I'm not doing something that would lead to memory errors? arma - Is the math correct?

Trac:
Status: new to needs_review

There seems to be a problem with Python 2.6.6 which cannot convert float to Decimal directly. lunar and asn rewrote that part:

-             entropy += -(router.probability*Decimal(math.log(router.probability, 2)))
+             entropy += -(router.probability*router.probability.ln()/Decimal('2').ln())

Can you confirm that this code is still correct?

Do you want to add a new task-6232 directory in metrics-tasks.git and add your code there? You could send me a patch generated with git --format-patch that I'd merge into the metrics-tasks repo.

Trac:
paste_177731

write to file immediately instead of accumulating results into the global var

Trac:
pyentropy.hacked.py

Fixes problems with entropy calculation.

Trac:
Cc: atagar, gsathya to atagar, gsathya, identity.function@gmail.com

Trac:

See the attached graph for entropy values in the first half of 2012. These are calculated using pyentropy.hacked.py.

I looked up the two drops in February and April. The consensus weights there are based on self-reported descriptor bandwidths, because only 2 votes contained measured bandwidth values.

Updated the script to -

add usage function(as atagar wanted)
take into account exit and guard nodes

https://github.com/gsathya/metrics-tasks/commit/98859a1e367c4b5728fb0d7bc2c5acf8e99d2208

A clarification for the math which changed between the initial code and my update:

Assume the early days of Tor, where we have only a few relays in the consensus. In fact, it's 7 relays with the following bandwidth: [1, 1, 2, 2, 2, 3, 4]

The old code determined the total bandwidth (15, the sum of all bandwidths in the list) and calculated: 1/15 * log2(1/15) + 1/15 * log2(1/15) + 2/15 * log2(2/15) + 2/15 * log2(2/15) + 2/15 * log2(2/15) + 3/15 * log2(3/15) + 4/15 * log2(4/15)

The problem is the probabilities. E.g., for the value '1', we expect a probability of 2/7 (There are two instances of '1' in all 7 values) and not 1/15.

The uploaded version fixes that. It builds a hash table of the form: { bandwidth_value => observed occurrences }. Then, it iterates over the hash table, adds up the result and we have the entropy.

Trac:

See the graph of all relays/Exit relays/Guard relays.

Trac:
Cc: atagar, gsathya, identity.function@gmail.com to atagar, gsathya, identity.function@gmail.com, robgjansen

Diversity has 2 dimensions: bandwidth and location. We should capture both.

Bandwidth diversity means how likely each relay will be chosen based on Tor's current bandwidth-weighting scheme. Highest bandwidth diversity means each relay is chosen with the same probability.

Location diversity means how likely a relay belongs to a specific geographic authoritative entity. Highest security means that each geographic authoritative entity controls the same number of relays.

When analyzing the actual diversity of a given Tor network (i.e. consensus), we should include both bandwidth and location. One way to do this is to use entropy of bandwidth per authoritative location. For example, we can split the Internet into location (i.e. countries or ASes) and add up all the bandwidth for the relays in that location. Then, we can compute the entropy for each location.

Now since entropy alone probably lacks meaning in terms of diversity, we would also like to know the maximum diversity of a given Tor network (i.e. consensus) we could ever hope to obtain (under this analysis approach). This can be computed by taking the total bandwidth in the consensus or some other interval and equally distributing it to all locations. Then compute the entropy of each location, and sum them to find the maximum diversity of the network during that consensus or interval.

Finally, we quantify the degree of diversity of the network during an interval as the current diversity divided by the maximum diversity during that Interval. This will allow us to know how close to optimal we currently are. The current diversity and maximum diversity entropy graphs are probably also useful.

Replying to robgjansen:

Location diversity means how likely a relay belongs to a specific geographic authoritative entity. Highest security means that each geographic authoritative entity controls the same number of relays.

This should be "percentage of bandwidth", "not number of relays".

Make entropy-over-time graph

Child items ...

Activity