Opened 7 years ago

Closed 2 years ago

# Make entropy-over-time graph

Reported by: arma Medium Metrics/Analysis atagar, gsathya, identity.function@…, robgjansen, amj703 #6460

### Description

We have number-of-relays-over-time, and amount-of-bandwidth-over-time. We don't yet have entropy-of-consensus-weights-over-time.

The way to get it is to sum up all the weights in the consensus, then calculate, for each relay in the consensus, the probability it will be chosen (its weight divided by sum of weights) times - log_2 of that probability. Then sum up all those calculations, and you have the shannon entropy of that consensus. As an intuitive check, if the shannon entropy is b bits, that's equivalent to a Tor network with 2b nodes that all have the same weights.

http://en.wikipedia.org/wiki/Entropy_%28information_theory%29 for plenty more background.

Damian said he'd be happy to use stem to help extract the numbers, so from Karsten's perspective hopefully this is just a graphing exercise.

Once we have this one going, I will likely ask for the same sort of graph as if we used descriptor bandwidths for weights (that is, as if we didn't have bwauths running), so we can compare how much entropy we're giving up in exchange for the performance improvements we get from bwauths.

And after that, I'll want to know about the subset of nodes that have the Guard flag, and the subset that have the Exit flag. One step at a time, I know, but I figure I should give a sense of the scope of the future plans too. Thanks!

## Child Tickets

TicketTypeStatusOwnerSummary

### comment:2 Changed 7 years ago by karsten

Component: Metrics Website → Analysis Add entropy-over-time graph to network page → Make entropy-over-time graph

Sounds like a fun task. Damian, if you give me some numbers in a .csv file, I'm happy to write the graphing code in R. All code for this task could live in task-6232/ in metrics-tasks.git.

I'm moving this ticket to the Analysis component and changing the summary. The first step should be to extract data and manually make graphs to see if they tell us what we want to know. Automating everything and extending the metrics website is more complex than it may seem and should be step two.

### comment:4 Changed 7 years ago by gsathya

Status: new → needs_review

atagar - Can you please check if I'm not doing something that would lead to memory errors?
arma - Is the math correct?

### comment:5 Changed 7 years ago by karsten

There seems to be a problem with Python 2.6.6 which cannot convert float to Decimal directly. lunar and asn rewrote that part:

```-             entropy += -(router.probability*Decimal(math.log(router.probability, 2)))
+             entropy += -(router.probability*router.probability.ln()/Decimal('2').ln())
```

Can you confirm that this code is still correct?

Do you want to add a new task-6232 directory in metrics-tasks.git and add your code there? You could send me a patch generated with git --format-patch that I'd merge into the metrics-tasks repo.

### Changed 7 years ago by asn

write to file immediately instead of accumulating results into the global var

### Changed 7 years ago by phw

Fixes problems with entropy calculation.

### comment:7 Changed 7 years ago by karsten

See the attached graph for entropy values in the first half of 2012. These are calculated using `pyentropy.hacked.py`.

I looked up the two drops in February and April. The consensus weights there are based on self-reported descriptor bandwidths, because only 2 votes contained measured bandwidth values.

### comment:8 Changed 7 years ago by gsathya

Updated the script to -
1) add usage function(as atagar wanted)
2) take into account exit and guard nodes

### comment:9 follow-up:  14 Changed 7 years ago by phw

A clarification for the math which changed between the initial code and my update:

Assume the early days of Tor, where we have only a few relays in the consensus. In fact, it's 7 relays with the following bandwidth:
`[1, 1, 2, 2, 2, 3, 4]`

The old code determined the total bandwidth (15, the sum of all bandwidths in the list) and calculated:
`1/15 * log2(1/15) + 1/15 * log2(1/15) + 2/15 * log2(2/15) + 2/15 * log2(2/15) + 2/15 * log2(2/15) + 3/15 * log2(3/15) + 4/15 * log2(4/15)`

The problem is the probabilities. E.g., for the value '1', we expect a probability of 2/7 (There are two instances of '1' in all 7 values) and not 1/15.

The uploaded version fixes that. It builds a hash table of the form: `{ bandwidth_value => observed occurrences }`. Then, it iterates over the hash table, adds up the result and we have the entropy.

### comment:10 Changed 7 years ago by karsten

See the graph of all relays/Exit relays/Guard relays.

### comment:12 follow-up:  13 Changed 7 years ago by robgjansen

Diversity has 2 dimensions: bandwidth and location. We should capture both.

Bandwidth diversity means how likely each relay will be chosen based on Tor's current bandwidth-weighting scheme. Highest bandwidth diversity means each relay is chosen with the same probability.

Location diversity means how likely a relay belongs to a specific geographic authoritative entity. Highest security means that each geographic authoritative entity controls the same number of relays.

When analyzing the actual diversity of a given Tor network (i.e. consensus), we should include both bandwidth and location. One way to do this is to use entropy of bandwidth per authoritative location. For example, we can split the Internet into location (i.e. countries or ASes) and add up all the bandwidth for the relays in that location. Then, we can compute the entropy for each location.

Now since entropy alone probably lacks meaning in terms of diversity, we would also like to know the maximum diversity of a given Tor network (i.e. consensus) we could ever hope to obtain (under this analysis approach). This can be computed by taking the total bandwidth in the consensus or some other interval and equally distributing it to all locations. Then compute the entropy of each location, and sum them to find the maximum diversity of the network during that consensus or interval.

Finally, we quantify the degree of diversity of the network during an interval as the current diversity divided by the maximum diversity during that Interval. This will allow us to know how close to optimal we currently are. The current diversity and maximum diversity entropy graphs are probably also useful.

### comment:13 in reply to:  12 Changed 7 years ago by robgjansen

Location diversity means how likely a relay belongs to a specific geographic authoritative entity. Highest security means that each geographic authoritative entity controls the same number of relays.

This should be "percentage of bandwidth", "not number of relays".

### comment:14 in reply to:  9 Changed 7 years ago by phw

A clarification for the math which changed between the initial code and my update:

Assume the early days of Tor, where we have only a few relays in the consensus. In fact, it's 7 relays with the following bandwidth:
`[1, 1, 2, 2, 2, 3, 4]`

The old code determined the total bandwidth (15, the sum of all bandwidths in the list) and calculated:
`1/15 * log2(1/15) + 1/15 * log2(1/15) + 2/15 * log2(2/15) + 2/15 * log2(2/15) + 2/15 * log2(2/15) + 3/15 * log2(3/15) + 4/15 * log2(4/15)`

The problem is the probabilities. E.g., for the value '1', we expect a probability of 2/7 (There are two instances of '1' in all 7 values) and not 1/15.

The uploaded version fixes that. It builds a hash table of the form: `{ bandwidth_value => observed occurrences }`. Then, it iterates over the hash table, adds up the result and we have the entropy.

Note that the above is probably not what we want. It calculates the entropy based on the probability of a bandwidth instead of a relay being selected. So gsathya was right all along.

### comment:15 Changed 7 years ago by gsathya

Revert back to original algorithm - Patch by phw

### comment:16 follow-up:  17 Changed 7 years ago by asn

Status: needs_review → new

Did anyone run the newest version of this with the 2011/2012 consensus set?

### comment:17 in reply to:  16 ; follow-up:  18 Changed 7 years ago by karsten

Did anyone run the newest version of this with the 2011/2012 consensus set?

I just did, but with the consensuses since 2007. See the newly attached graph.

gsathya, I get a lot of these errors when parsing older consensuses: `"Error: amount of routers must be > 0."` I uploaded one such consensus here. Mind having a look?

### comment:18 in reply to:  17 ; follow-up:  19 Changed 7 years ago by gsathya

Status: new → needs_information

gsathya, I get a lot of these errors when parsing older consensuses: `"Error: amount of routers must be > 0."` I uploaded one such consensus here. Mind having a look?

I also have Rob's entropy calculation based on the probability that a country might be chosen(bw of all relays in a particular country/total bw of all relays of all countries), but I have no idea what to call it. Also, should this be 1 value per consensus or 1 value per country per consensus? I also have another entropy calculation based on the probability that a relay in a particular country might be chosen(bw of relay in a particular country/total bw of all relays in that country), I have no idea what to call this either.

### comment:19 in reply to:  18 ; follow-up:  20 Changed 7 years ago by robgjansen

I also have Rob's entropy calculation based on the probability that a country might be chosen(bw of all relays in a particular country/total bw of all relays of all countries), but I have no idea what to call it. Also, should this be 1 value per consensus or 1 value per country per consensus?

Let p_i be the bw of all relays in country i
Compute this for all countries i
Let s be the total bandwidth of all relays

The entropy of this particular consensus is then computed as:

• ( sum_(all i) p_i log_2 (p_i) )

[note there is a negative sign in front of the sum]

You could call this the diversity of a consensus in the face of a national level adversary. Does this make sense?

### comment:20 in reply to:  19 ; follow-up:  21 Changed 7 years ago by robgjansen

I also have Rob's entropy calculation based on the probability that a country might be chosen(bw of all relays in a particular country/total bw of all relays of all countries), but I have no idea what to call it. Also, should this be 1 value per consensus or 1 value per country per consensus?

Sorry, I need to start doing preview before submitting my comments. I've corrected my comment below.

Let b_i be the bw of all relays in country i
Let s be the total bandwidth of all relays

Let p_i be b_i / s (i.e. the probability that a selected relay belongs to country i)

Compute this for all countries i

The entropy of this particular consensus is then computed as:

• ( sum_(all i) p_i log_2 (p_i) )

[note there is a negative sign in front of the sum]

You could call this the diversity of a consensus in the face of a national level adversary. Does this make sense?

### comment:21 in reply to:  20 ; follow-up:  22 Changed 7 years ago by gsathya

Status: needs_information → needs_review

Sorry, I need to start doing preview before submitting my comments. I've corrected my comment below.

Let b_i be the bw of all relays in country i
Let s be the total bandwidth of all relays

Let p_i be b_i / s (i.e. the probability that a selected relay belongs to country i)

Compute this for all countries i

The entropy of this particular consensus is then computed as:

• ( sum_(all i) p_i log_2 (p_i) )

[note there is a negative sign in front of the sum]

You could call this the diversity of a consensus in the face of a national level adversary. Does this make sense?

### comment:22 in reply to:  21 ; follow-up:  23 Changed 7 years ago by robgjansen

Any graphs to share?

### comment:23 in reply to:  22 ; follow-up:  24 Changed 7 years ago by gsathya

Any graphs to share?

Karsten

### comment:24 in reply to:  23 Changed 7 years ago by gsathya

Any graphs to share?

Karsten

Karsten?

### comment:25 Changed 7 years ago by karsten

Merged the changes. Thanks! Analysis is still running.. Will post results once I have them.

### comment:26 Changed 7 years ago by karsten

There, I attached a new graph.

### comment:27 Changed 7 years ago by karsten

It seems that #6329 is quite related here. It even comes with code.

### comment:28 follow-up:  29 Changed 7 years ago by robgjansen

We are still missing the graph where each of the plots are represented as a percentage, computed as: (entropy / max_entropy). See my first post above.

### comment:29 in reply to:  28 ; follow-up:  30 Changed 7 years ago by asn

We are still missing the graph where each of the plots are represented as a percentage, computed as: (entropy / max_entropy). See my first post above.

Say, if the entropy of the current consensus is 8 bits, and the maximum possible entropy is 11 bits, do you think that 70% would be the correct figure in this case? It seems like that the logarithmic scale gives us an overrated percentage.

Maybe we should transform the values to the linear scale (`2^8/2^11`) to get a more intuitive percentage. Or is there a better way to represent this comparison?

### comment:30 in reply to:  29 ; follow-up:  31 Changed 7 years ago by robgjansen

We are still missing the graph where each of the plots are represented as a percentage, computed as: (entropy / max_entropy). See my first post above.

Say, if the entropy of the current consensus is 8 bits, and the maximum possible entropy is 11 bits, do you think that 70% would be the correct figure in this case? It seems like that the logarithmic scale gives us an overrated percentage.

Maybe we should transform the values to the linear scale (`2^8/2^11`) to get a more intuitive percentage. Or is there a better way to represent this comparison?

I don't understand your opposition, but it shows that the meaning of entropy to describe anonymity is not well understood. (I've been asking 'WTF does entropy mean?' all along.)

Read this paper, particularly Section 3.

### comment:31 in reply to:  30 Changed 7 years ago by asn

We are still missing the graph where each of the plots are represented as a percentage, computed as: (entropy / max_entropy). See my first post above.

Say, if the entropy of the current consensus is 8 bits, and the maximum possible entropy is 11 bits, do you think that 70% would be the correct figure in this case? It seems like that the logarithmic scale gives us an overrated percentage.

Maybe we should transform the values to the linear scale (`2^8/2^11`) to get a more intuitive percentage. Or is there a better way to represent this comparison?

I don't understand your opposition, but it shows that the meaning of entropy to describe anonymity is not well understood. (I've been asking 'WTF does entropy mean?' all along.)

Read this paper, particularly Section 3.

Oh, I did not oppose your idea; I also agree that entropy and bits are not intuitive ways of measuring diversity or whatever we are trying to measure here.

What I tried to say is that maybe degree of anonymity is not the most intuitive way of graphing How far away are we from the ideal network (wrt bandwidth weights) of 'n' nodes. I'm saying this because entropy is measured in logarithmic units and the logarithmic scale is harder to understand than the linear scale (IMO).

For example, what does a `0.76` degree of anonymity means? It does not mean that the current Tor network is 76% of a perfect Tor network because of the logarithmic nature of entropy. That's the reason I asked if it would be more helpful to conver the <entropy>/<max entropy> figure to the linear scale.

In any case, degree of anonymity should be more intuitive than plain entropy: I coded the change in branch `bug6232` of `git://gitorious.org/metrics-tasks/metrics-tasks.git` (https://gitorious.org/metrics-tasks/metrics-tasks/commit/e54b4278361184bf6ff90f189898668160474f2f).

Also, thanks for the link to the paper.

### comment:32 follow-ups:  33  34 Changed 7 years ago by karsten

Status: needs_review → needs_revision

A few comments after re-reading the whole ticket:

• I merged George's patch (thanks!) that outputs degree of anonymity instead of plain entropy. I'll run it shortly and will post the resulting graph once I have it.
• Should we add a second graph that plots entropy and maximum entropy as two lines, as Rob suggested above? That graph should probably consist of 2 x 2 sub graphs for the four cases we distinguish. Should be trivial to extend the script to output entropy and max_entropy along with their quotient. I'll look into that and write the graphing code in a bit.
• I wonder if entropies based on subsets of Exit and Guard flagged relays are correct. I spent yesterday afternoon on trying to learn how path selection really works (#5755). I think we'll have to take bandwidth weights as reported in the footer section of a consensus into account, too. Those bandwidth weights influence, for example, how to weight the consensus weight of a relay with the Exit flag and a relay with Exit and Guard flag for the exit position. In a consensus published yesterday, the former was weighted with Wee=1.0, whereas the latter was weighted with Wed=0.4272. Similarly, bandwidth weights for the guard position were Wgd=0.2864 and Wgg=0.6446, so quite different. If we only look at the Exit or Guard flag of a relay, we might be quite off. But before we change anything here, I want to hear back from Mike or Roger if my understanding of path selection is correct.
• The GeoIP database is part of the sources in metrics-tasks.git, right? Can we change that and have users provide their own geoip file? I'm worried that the current "a1" madness influences the results, and I'd like to swap the current database with the one from February which didn't have "a1" relays all over.
• Can we add AS-based entropy values, too? There's an AS database from Maxmind that we could use here. Again, users could provide that database file, so there's no need to commit it to the Git repo.
• In the longer term, do we want to include family diversity? That metric would consider all relays in the same relay family as one entity, similar to how we consider all relays in the same country as one entity in the country diversity metric. I admit that it's hard to extract families using the current code, because we'd have to parse server descriptors for that, too. I'm also not certain that the results will be meaningful. So, longer-term.
• A shorter-term goal could be to compute bandwidth diversity based on the relays' advertised bandwidths, not based on their consensus weights. Relays report their advertised bandwidth in their server descriptor; it's the minimum of bandwidth rate, burst, and observed bandwidth. We'll want to compute bandwidth diversity for all relays and for exit/guard subsets as well as location diversity. This is what Roger was referring to in the last but one paragraph of the ticket description. Again, I admit that it's non-trivial to extract advertised bandwidths, because we'll have to parse server descriptors. But it's easier to compute than relay families.

gsathya, are you up for more coding fun? Didn't you worry that this task might be too trivial for a thesis? Hah! :)

### comment:33 in reply to:  32 Changed 7 years ago by karsten

• I merged George's patch (thanks!) that outputs degree of anonymity instead of plain entropy. I'll run it shortly and will post the resulting graph once I have it.

Here's the graph showing degree of anonymity. For reference, it took my machine 8:30 hours to compute the entropy values.

• Should we add a second graph that plots entropy and maximum entropy as two lines, as Rob suggested above? That graph should probably consist of 2 x 2 sub graphs for the four cases we distinguish. Should be trivial to extend the script to output entropy and max_entropy along with their quotient. I'll look into that and write the graphing code in a bit.

Here's another graph showing entropy and maximum entropy.

The remaining comments from above are still valid. So, we're not done, yet.

### comment:34 in reply to:  32 ; follow-up:  35 Changed 7 years ago by gsathya

Excellent, more coding!

A few comments after re-reading the whole ticket:

• I wonder if entropies based on subsets of Exit and Guard flagged relays are correct. I spent yesterday afternoon on trying to learn how path selection really works (#5755). I think we'll have to take bandwidth weights as reported in the footer section of a consensus into account, too. Those bandwidth weights influence, for example, how to weight the consensus weight of a relay with the Exit flag and a relay with Exit and Guard flag for the exit position. In a consensus published yesterday, the former was weighted with Wee=1.0, whereas the latter was weighted with Wed=0.4272. Similarly, bandwidth weights for the guard position were Wgd=0.2864 and Wgg=0.6446, so quite different. If we only look at the Exit or Guard flag of a relay, we might be quite off. But before we change anything here, I want to hear back from Mike or Roger if my understanding of path selection is correct.
• The GeoIP database is part of the sources in metrics-tasks.git, right? Can we change that and have users provide their own geoip file? I'm worried that the current "a1" madness influences the results, and I'd like to swap the current database with the one from February which didn't have "a1" relays all over.
• Can we add AS-based entropy values, too? There's an AS database from Maxmind that we could use here. Again, users could provide that database file, so there's no need to commit it to the Git repo.

Yep, all the three comments can be done pretty easily.

• In the longer term, do we want to include family diversity? That metric would consider all relays in the same relay family as one entity, similar to how we consider all relays in the same country as one entity in the country diversity metric. I admit that it's hard to extract families using the current code, because we'd have to parse server descriptors for that, too. I'm also not certain that the results will be meaningful. So, longer-term.
• A shorter-term goal could be to compute bandwidth diversity based on the relays' advertised bandwidths, not based on their consensus weights. Relays report their advertised bandwidth in their server descriptor; it's the minimum of bandwidth rate, burst, and observed bandwidth. We'll want to compute bandwidth diversity for all relays and for exit/guard subsets as well as location diversity. This is what Roger was referring to in the last but one paragraph of the ticket description. Again, I admit that it's non-trivial to extract advertised bandwidths, because we'll have to parse server descriptors. But it's easier to compute than relay families.

Actually stem can parse the server descriptors now. So this wouldn't be hard at all. I can teach the script to use stem for both families and advertised bandwidths.

gsathya, are you up for more coding fun? Didn't you worry that this task might be too trivial for a thesis? Hah! :)

Heh indeed! Fun :)

### comment:35 in reply to:  34 ; follow-up:  36 Changed 7 years ago by karsten

• I wonder if entropies based on subsets of Exit and Guard flagged relays are correct. [...] I want to hear back from Mike or Roger if my understanding of path selection is correct.

Mike says in #5755 that the calculation of P_exit and P_guard in that ticket looks correct.

That means, instead of just looking at the subsets of Exit or Guard flagged relays and using their unmodified consensus weights, we'll want to do the following:

• To calculate entropy of exits,
• consider only those relays with the Exit flag,
• multiply the consensus weight of a relay with Exit and Guard flag with Wed,
• multiply the consensus weight of a relay with only Exit flag with Wee,
• use "weighted" consensus weights wherever unweighted consensus weights were used before.
• To calculate entropy of guards,
• consider only those relays with the Guard flag,
• multiply the consensus weight of a relay with Exit and Guard flag with Wgd,
• multiply the consensus weight of a relay with only Guard flag with Wgg,
• use "weighted" consensus weights wherever unweighted consensus weights were used before.

Yep, all the three comments can be done pretty easily.

Cool!

Actually stem can parse the server descriptors now. So this wouldn't be hard at all. I can teach the script to use stem for both families and advertised bandwidths.

Okay. I'd think that advertised bandwidths are higher priority than families though, and I'm not entirely sure how complex it is to extract families from server descriptors. I think we only consider N relays to be in the same family if they all reference the N-1 others as family members. If a relay is not contained in a family, it constitutes a family of size 1. Doesn't sound impossible to implement, but who knows.

### comment:36 in reply to:  35 Changed 7 years ago by gsathya

Status: needs_revision → needs_review

Done, except for the families - https://github.com/gsathya/metrics-tasks/commits/master

### comment:37 follow-up:  38 Changed 7 years ago by karsten

Status: needs_review → needs_revision

Neat! I took a quick look at your branch and plan to take a closer look tomorrow. But for now, I have two comments:

• The stem sources shouldn't be checked into the metrics-tasks repository. Can you remove the commit adding stem and replace it with instructions for including stem's sources?
• The parsing code for Wed, Wee, etc. is a rather dirty hack, because it depends on the position of weights in the line. I'm not even sure if weights have been added in the past years, so this is very likely to break.

As I said, I'm going to do a more detailed review tomorrow, and then I'll run your code on a few gigs of data. Looking forward to that! :) Thanks!

### comment:38 in reply to:  37 Changed 7 years ago by gsathya

• The stem sources shouldn't be checked into the metrics-tasks repository. Can you remove the commit adding stem and replace it with instructions for including stem's sources?

Right now, it's a ugly hack to include stem, since we need stem/stem and not stem/, we can't use git submodules -

```#!/bin/bash
git clone git://git.torproject.org/stem.git
cp -r stem/stem stem1
rm -rf stem/
mv stem1 stem
```

Do we want to do this or just include stem in metrics-lib?

• The parsing code for Wed, Wee, etc. is a rather dirty hack, because it depends on the position of weights in the line. I'm not even sure if weights have been added in the past years, so this is very likely to break.

I've made it better now. Let me know what you think of this - https://github.com/gsathya/metrics-tasks/commit/c87685dd83da7e8abc469d9306b5dfe4d2dfd05d

As I said, I'm going to do a more detailed review tomorrow, and then I'll run your code on a few gigs of data. Looking forward to that! :) Thanks!

Great =)

### comment:39 Changed 7 years ago by atagar

Right now, it's a ugly hack to include stem, since we need stem/stem and not stem/, we can't use git submodules

Quick search seems to suggest that you can't do this with git submodules. The proper solution is for us to add a setup.py to stem and install it as a library. Barring that, if you just want a local copy then we can do something like what I did in arm to fetch its torctl dependency...

### comment:40 follow-up:  41 Changed 7 years ago by karsten

I'm using gsathya's hack to include stem for now. I'm fine to use whatever solution you come up with as long as it doesn't involve duplicating code. Well, if we can avoid installing things, that'd be good, too. Is there an equivalent to .jar files in Python? Anyway, that's a more general discussion for deploying stem, so this is probably the wrong place for that discussion.

Thanks for the fix in c87685d. Looks good.

I rebased your branch and threw out the commit that added stem. I also made a few more fixes, mostly because we don't have a good code review tool and because I wanted to run the script today, rather than do another review round. I pushed the fixes to my public repo, branch gsathya_master. Can you check if you like them before I push to the official repo? Thanks!

### comment:41 in reply to:  40 ; follow-ups:  42  43 Changed 7 years ago by gsathya

Status: needs_revision → needs_information

I rebased your branch and threw out the commit that added stem. I also made a few more fixes, mostly because we don't have a good code review tool and because I wanted to run the script today, rather than do another review round. I pushed the fixes to my public repo, branch gsathya_master. Can you check if you like them before I push to the official repo? Thanks!

Yeah, this looks good. Thanks for the fixes!

Is there a next step to this ticket? From my notes from the Tor dev meeting, we still need to -

1. Make this runnable every 3 hours
2. Make it read files in subdirs
3. Find different metrics-
1. Legislative diversity - based on countries
2. Organizational diversity - based on Family
3. Financial diversity - ??
4. Physical location diversity - ??
5. Network diversity - Subnets
4. Give weights to every metric

Looks like #6443 is done(?), are there other things that arma might find interesting?

### comment:42 in reply to:  41 ; follow-up:  47 Changed 7 years ago by karsten

Yeah, this looks good. Thanks for the fixes!

Great! Merging.

Is there a next step to this ticket? From my notes from the Tor dev meeting, we still need to -

1. Make this runnable every 3 hours

Right. The goal here would be to add graphs to the metrics website. The script would first rsync consensuses and server descriptors from metrics and then process only the new consensuses to update the local .csv file. metrics-web would then use the .csv file to generate graphs based on user input.

1. Make it read files in subdirs

That's for consensus files, right? We don't really need that feature anymore.

But I'm not sure if my file system will like me anymore if I move a few years of server descriptors into a single directory without sub directories. I'll try to find out.

1. Find different metrics-
1. Legislative diversity - based on countries
2. Organizational diversity - based on Family
3. Financial diversity - ??
4. Physical location diversity - ??
5. Network diversity - Subnets

Right. We already have legislative and network diversity, and organizational diversity is doable. I don't see how we would come up with financial or physical diversity metrics. I'd say it's safe to ignore those.

But let's step back: all our metrics consist of three components:

• (1a) based on consensus weights vs. (1b) based on advertised bandwidth
• weighting bandwidth values by (2a) guard or (2b) exit position, or (2c) considering unweighted bandwidth
• aggregating nodes by (3a) country, (3b) AS, (3c) family, or (3d) comparing single nodes

That's 2 * 3 * 4 = 24 possible combinations. We have implemented five of them. For example, in arma's first comment on #6443 he's asking for advertised bandwidths in the exit position for single relays. We don't have those numbers yet. Want to add the remaining 19 combinations, each of them with entropy and max entropy?

(Note that we don't have a single metric based on advertised bandwidth yet.)

1. Give weights to every metric

What does that mean?

Looks like #6443 is done(?), are there other things that arma might find interesting?

#6443 is not done yet, because we should implement the metric in comment 1.

### comment:43 in reply to:  41 Changed 7 years ago by asn

Parent ID: → #6460

I rebased your branch and threw out the commit that added stem. I also made a few more fixes, mostly because we don't have a good code review tool and because I wanted to run the script today, rather than do another review round. I pushed the fixes to my public repo, branch gsathya_master. Can you check if you like them before I push to the official repo? Thanks!

Yeah, this looks good. Thanks for the fixes!

Is there a next step to this ticket? From my notes from the Tor dev meeting, we still need to -

1. Make this runnable every 3 hours
2. Make it read files in subdirs
3. Find different metrics-
1. Legislative diversity - based on countries
2. Organizational diversity - based on Family
3. Financial diversity - ??
4. Physical location diversity - ??
5. Network diversity - Subnets
4. Give weights to every metric

Looks like #6443 is done(?), are there other things that arma might find interesting?

I made #6460 for metrics research on measuring the safety of the Tor network.

### comment:44 Changed 7 years ago by atagar

Here's a bit of an after-the-fact code review (I'm just looking at the stem usage since I have a strong interest in making that better on my end).

``` 27 from stem.descriptor.server_descriptor import RelayDescriptor, BridgeDescriptor
```

Import is currently unused.

``` 66             desc_iter = stem.descriptor.server_descriptor.parse_file(StringIO.StringIO(data))
67             desc_entries = list(desc_iter)
68             desc = desc_entries[0]
```

Does data include multiple descriptors or annotations (such as from cached-descriptors)? If not then there's no reason to do this. The parse_file() function is to make parsing descriptor _files_ easier. If you just have a string representing a single relay then this should be equivalent to...

```desc = RelayDescriptor(data)
```

### comment:45 follow-up:  48 Changed 7 years ago by karsten

asn, thanks for creating #6460. Will soon add more child tickets to it.

atagar, gsathya, once you have a patch for using stem correctly, I'll apply it. Please let me know.

gsathya, arma, I realized yesterday that our usage of a single GeoIP and a single AS database is flawed. We totally ignore the fact that an IP address might have belonged to a different country or AS a year ago. What we should do is use different databases depending on the valid-after time of the consensus we're evaluating. After all, there's a reason why these database are updated. Note that this is a general problem of metrics.tpo, not one that is specific to this analysis. Changing GeoIP databases were the reason why we gave up including country codes in sanitized bridge descriptors, and they're the reason why the relays-by-country graph still works with a static, year-old GeoIP database. Just mentioning the problem here. Maybe it's time to solve it. Hmmmmm.

### comment:46 Changed 7 years ago by karsten

Status: needs_information → needs_revision

New graphs available. For reference, these graphs kept my local (tiny) number cruncher busy for 11:30 hours.

Setting to needs_revision for the improvement ideas stated above.

### comment:47 in reply to:  42 ; follow-up:  52 Changed 7 years ago by gsathya

Is there a next step to this ticket? From my notes from the Tor dev meeting, we still need to -

1. Make this runnable every 3 hours

Right. The goal here would be to add graphs to the metrics website. The script would first rsync consensuses and server descriptors from metrics and then process only the new consensuses to update the local .csv file. metrics-web would then use the .csv file to generate graphs based on user input.

Cool. I think atagar mentioned that stem can keep track of read consensus files, I'll take a look at this now.

1. Make it read files in subdirs

That's for consensus files, right? We don't really need that feature anymore.

Yeah.

But I'm not sure if my file system will like me anymore if I move a few years of server descriptors into a single directory without sub directories. I'll try to find out.

Ok.

1. Find different metrics-
2. Legislative diversity - based on countries
3. Organizational diversity - based on Family
4. Financial diversity - ??
5. Physical location diversity - ??
6. Network diversity - Subnets

Right. We already have legislative and network diversity, and organizational diversity is doable. I don't see how we would come up with financial or physical diversity metrics. I'd say it's safe to ignore those.

Ok.

But let's step back: all our metrics consist of three components:

• (1a) based on consensus weights vs. (1b) based on advertised bandwidth
• weighting bandwidth values by (2a) guard or (2b) exit position, or (2c) considering unweighted bandwidth
• aggregating nodes by (3a) country, (3b) AS, (3c) family, or (3d) comparing single nodes

That's 2 * 3 * 4 = 24 possible combinations. We have implemented five of them. For example, in arma's first comment on #6443 he's asking for advertised bandwidths in the exit position for single relays. We don't have those numbers yet. Want to add the remaining 19 combinations, each of them with entropy and max entropy?

Looks like #6443 is using code from #5755, which is all Java. That would mean I'd have to rewrite it in py or just continue in Java which would essentially mean you'd have to rewrite all my Java code again to make it work ;). Should I pick some other metrics ticket?

(Note that we don't have a single metric based on advertised bandwidth yet.)

Yes, I was wondering about this. Why are we using the bandwidth and not the advertised bandwidth?

1. Give weights to every metric

What does that mean?

It means give different weights to these metrics while calculating the total entropy -

1. Legislative diversity - based on countries
2. Organizational diversity - based on Family
3. Financial diversity - ??
4. Physical location diversity - ??
5. Network diversity - Subnets

### comment:48 in reply to:  45 Changed 7 years ago by gsathya

atagar, gsathya, once you have a patch for using stem correctly, I'll apply it. Please let me know.

We can't use

```desc = RelayDescriptor(data)
```

because data contains annotations. I've already mentioned this to atagar on IRC and he said he'd add a convenience method to make it easier/nicer.

gsathya, arma, I realized yesterday that our usage of a single GeoIP and a single AS database is flawed. We totally ignore the fact that an IP address might have belonged to a different country or AS a year ago. What we should do is use different databases depending on the valid-after time of the consensus we're evaluating.

Yikes, keeping multiple GeoIP dbs sounds like a PITA.

### comment:49 Changed 7 years ago by atagar

I've already mentioned this to atagar on IRC and he said he'd add a convenience method to make it easier/nicer.

Correction, I said there were multiple options...

• Make the ServerDescriptor class allow for annotations rather than processing them via the parse_file() function. It was originally designed this way because only cached-descriptors had annotations (so metrics descriptors could have their raw string passed like I proposed). However, that has since changed.
• Add a convenience method to stem.descriptor.reader to parse a single descriptor from a single file. This would be trivial to do and is probably the way to go.

I made suggestions, but I didn't say that *I'd* do it, though I might later if no one else does. Patches welcome. :)

/me goes poof now to head to defcon

### comment:50 Changed 7 years ago by asn

Karsten, next time you decide to burn your computer for 12 hours can you also merge my `bug6232_fair_net` branch (https://gitorious.org/metrics-tasks/metrics-tasks/commit/53b2c25186db9fd64cb0b2b210c042adb5838ae9)?

It generates yet another graph, which compares the number of relays of the current Tor network with the number of relays of a fair Tor network with the same bw weights entropy (by fair I mean a Tor network where all relays have the same bandwidth weights).

BTW, I made #6466 to analyse the results of this ticket, so that we don't bloat this ticket with analysis

### Changed 7 years ago by karsten

Number of relays and number of fair relays

### Changed 7 years ago by karsten

Raw data required for making graphs

### comment:51 Changed 7 years ago by karsten

asn, I merged your branch (after rebasing it) and tweaked your graph a bit. Also, it doesn't take 12 hours to plot graphs, but to extract the data for them. I attached the entropy.csv file that you need for making graphs. If you come up with more graphs, let me know, and I'll merge your code.

### comment:52 in reply to:  47 ; follow-ups:  53  54 Changed 7 years ago by karsten

Cool. I think atagar mentioned that stem can keep track of read consensus files, I'll take a look at this now.

Neat!

That's 2 * 3 * 4 = 24 possible combinations. We have implemented five of them. For example, in arma's first comment on #6443 he's asking for advertised bandwidths in the exit position for single relays. We don't have those numbers yet. Want to add the remaining 19 combinations, each of them with entropy and max entropy?

Looks like #6443 is using code from #5755, which is all Java. That would mean I'd have to rewrite it in py or just continue in Java which would essentially mean you'd have to rewrite all my Java code again to make it work ;).

Ah, I meant implementing the 24 possible combinations in your #6232 code, not in #6443. But you're right, we should merge the code of both tickets at some point, ideally rewriting #6443 in Python. The two scripts are doing the same calculations, but outputting different values (entropy values vs. top bandwidths). I'm not sure yet which values #6443 should provide, particularly with respect to the 24 combinations; that's a huge amount of data. I'd say let's wait a bit until we're clearer about #6443, and then we integrate the code into #6232.

Should I pick some other metrics ticket?

Want to look into #6471?

Yes, I was wondering about this. Why are we using the bandwidth and not the advertised bandwidth?

Because clients use consensus weights to make path-selection decisions. These consensus weights are measured and voted on by bandwidth authorities, unlike the self-reported advertised bandwidths. That's why we're interested in advertised bandwidths: how would all these fine metrics look like if we had no bandwidth authorities measuring the network?

It means give different weights to these metrics while calculating the total entropy -

1. Legislative diversity - based on countries
2. Organizational diversity - based on Family
3. Financial diversity - ??
4. Physical location diversity - ??
5. Network diversity - Subnets

I still don't understand. Would such a metric consist of 30% legislative diversity, 20% organizational diversity, etc.? I think we should compare these metrics separately, not combine them into a single number that nobody can interpret.

### comment:53 in reply to:  52 Changed 7 years ago by gsathya

I'd say let's wait a bit until we're clearer about #6443, and then we integrate the code into #6232.

Ok

Should I pick some other metrics ticket?

Want to look into #6471?

Aha. Okay

I still don't understand. Would such a metric consist of 30% legislative diversity, 20% organizational diversity, etc.? I think we should compare these metrics separately, not combine them into a single number that nobody can interpret.

Well, this was supposed to give an appropriate answer when deciding between sponsoring relays in various countries(Ex - a relay in America Vs a relay in India). Legislatively it would be better to have it in India, but financially not. So, depending on how we give the weights, the answer would change.

### comment:54 in reply to:  52 Changed 7 years ago by arma

Would such a metric consist of 30% legislative diversity, 20% organizational diversity, etc.? I think we should compare these metrics separately, not combine them into a single number that nobody can interpret.

Agree. If we try to mush them together, we'll constantly be wondering which component caused the graph to change like that. Better to see n graphs, one for each component.

### comment:55 Changed 7 years ago by asn

Attached some new graphs. Exit nodes seem more "diverse" now, which is probably a result of the fast-exit campaign.

### comment:57 Changed 7 years ago by karsten

Status: needs_revision → needs_information

I wanted to work on this ticket today, but I'm unclear what the goal is. I'd want to avoid spending hours or days on coding something that ends up as yet another Trac attachment. Is the goal to answer one or more research questions that we haven't answered yet (if so, which ones?), or is the goal to find graphs that we should automate and make available on the metrics website (if so, which ones?)? Or did we already accomplish the ticket goal "Make entropy-over-time graph", so that we can close this ticket?

### comment:58 Changed 2 years ago by karsten

Resolution: → wontfix needs_information → closed

Closing tickets in Metrics/Analysis that have been created 5+ years ago and not seen progress recently, except for the ones that "nickm-cares" about.

Note: See TracTickets for help on using tickets.