Opened 9 years ago

Closed 8 years ago

#2680 closed task (implemented)

present bridge usage data so researchers can focus on the math

Reported by: arma Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Right now the process of learning how to parse bridge consensus files, bridge descriptor files, match up which descriptors go with which consensus line, which bridges were Running when, etc is too burdensome -- researchers who want to analyze bridge reachability are giving up before they even get to the part they tried to sign up for.

The questions they will ask are "what bridges are up today? what data are they reporting?" and "what data did this bridge report each day?"

Karsten speculates:
I guess it would be a text file with each line containing: sanitized relay fingerprint, seen as relay before, bridge stats end, bridge stats seconds, connecting users by country, pool assignments. 1 line per published bridge descriptor.

Once we have this data available, it would also be interesting to graph a few histories of individual bridges, to see if there's any intuition to be gained from the more popular one, or to learn whether only a few bridges contribute to most of our usage stats.

Child Tickets

Change History (6)

comment:1 Changed 9 years ago by karsten

Status: newassigned

Here's my first attempt for presenting bridge usage data in a way that is more useful to researchers:

We have (at least) four data sources that are relevant for analyzing bridge usage:

  1. Bridge descriptors: Bridges publish bridge descriptors to the bridge authority at least once every 18 hours.
  1. Bridge network statuses: The bridge authority forms an opinion on all bridges that published a descriptor recently, decides whether it considers them as running, and writes these opinions to a bridge network status document every 30 minutes.
  1. BridgeDB pool assignments: BridgeDB learns about currently running bridges from the bridge authority and allocates these bridges to distributors like email or https or keeps them unallocated for manual distribution.
  1. Relay consensuses: The directory authorities vote on running relays (not bridges) every hour and publish a network status consensus. If a bridge uses the same identity key that it also used as a relay, it might observe more users than it would observe as a pure bridge. Therefore, bridges that have been running as relays before should be excluded from bridge statistics.

When Roger and I talked about this idea on IRC, I thought that we could merge data from these 4 sources into a single file. Let's step back. We should start with 4 data formats that are easier to parse than the current data sources and let researchers assemble the files themselves. We can discuss merging these 4 data formats into 1 at a later time.

I wrote two Java programs to parse the data on the metrics website and generate 3 of these 4 data formats. (We're still in the process of patching BridgeDB to dump its pool assignments to a file for the 3rd data source in the list above. Once we're done with that, I'll write another Java program to provide the 4th data format.) I can integrate these programs into metrics-db and provide these formats on a daily basis, but before doing so, I'd like to know whether the formats are useful to people at all.

I uploaded a tarball of the three new data formats for January 2011 (39M). The source code to transform our standard tarballs into the new data formats plus a more detailed description of the data formats is in the metrics-tasks repository.

I'm going to make the 3rd data format (BridgeDB pool allocations) for January 2011 available as soon as I have it (hopefully in a week from now).

Also, I'm going to ignore the research questions listed in the ticket description above and let others answer them.

comment:2 Changed 9 years ago by karsten

Now that we have bridge pool assignments available, I wrote another Java program to parse them and make them easier to process. The new source code is in the metrics-tasks repository. I also updated the tarball with the data formats for January 2011 (now: 43M).

Does this mean this ticket is solved?

comment:3 Changed 9 years ago by arma

The "fingerprint" and "descriptor" in statuses.csv are always the same. I think you're printing "fingerprint" for both of them?

I think the next step is to write a short overview of how to reconstruct these files to answer some research question. For example, say I want to get a list of all the countries that a given bridge has seen over time. I guess I want to iterate over all bridge fingerprints -- should I use the list of all fingerprints I find in statuses.csv or in descriptors.csv -- should they be the same?

So step zero, given a fingerprint, is to look it up in relays.csv and make sure it's not there. If it is, either ignore it or if we want to get fancier, ignore data from it close to the time it's in the relay list.

Step one is to look it up in statuses.csv, get a set of descriptor hashes, discard all the ones whose third-to-last value is not TRUE, and skip duplicate hashes.

Then step two is to take those remaining descriptor hashes and look them up in descriptors.csv, at which point I can learn which countries they saw unless the countries are all NA in which case we don't have data?

And the optional step three is to take the timestamp from the status file and look up the fingerprint in assignments.csv to decide if it's http, email, or unassigned?

comment:4 in reply to:  3 Changed 9 years ago by karsten

Replying to arma:

The "fingerprint" and "descriptor" in statuses.csv are always the same. I think you're printing "fingerprint" for both of them?

Ooops, fixed.

I think the next step is to write a short overview of how to reconstruct these files to answer some research question.

See the new Section 3 of the README and the new R file analysis.R in task-2680.

For example, say I want to get a list of all the countries that a given bridge has seen over time. I guess I want to iterate over all bridge fingerprints -- should I use the list of all fingerprints I find in statuses.csv or in descriptors.csv -- should they be the same?

If you want to learn about usage by country, you should only look at descriptors.csv, not at statuses.csv. The data in bridge network statuses and the data in extra-info descriptors are not tightly connected (even though one can link them via the bridge's descriptor identifier). A bridge is free to write anything in its extra-info descriptor, including a few days old bridge statistics. That is in no way related to the bridge authority thinking that a bridge is running at a later time.

I added a note to the README.

So step zero, given a fingerprint, is to look it up in relays.csv and make sure it's not there. If it is, either ignore it or if we want to get fancier, ignore data from it close to the time it's in the relay list.

Correct. We're removing all bridges that have been seen as relays for the metrics graphs, because even with a time distance of 1 week we had unrealistic usage numbers that I couldn't explain otherwise. If someone wants to investigate this further, I'd be happy to learn if we can do something smarter.

Step one is to look it up in statuses.csv, get a set of descriptor hashes, discard all the ones whose third-to-last value is not TRUE, and skip duplicate hashes.

See above. Removing descriptors of non-running bridges is not meaningful here.

Then step two is to take those remaining descriptor hashes and look them up in descriptors.csv, at which point I can learn which countries they saw unless the countries are all NA in which case we don't have data?

NA means no data, right.

And the optional step three is to take the timestamp from the status file and look up the fingerprint in assignments.csv to decide if it's http, email, or unassigned?

The timestamps of the assignments and the timestamps of the bridge network statuses do not necessarily match precisely. But BridgeDB does not reassign bridges between distributors (yet), so there's no need to compare timestamps here.

I think that the example in analysis.R helps clarifying things a bit.

comment:5 Changed 9 years ago by karsten

Does silence mean everyone's happy and we can close the ticket?

comment:6 Changed 8 years ago by karsten

Resolution: implemented
Status: assignedclosed

Closing as implemented. Please re-open or create a new ticket if required.

Note: See TracTickets for help on using tickets.