Opened 5 years ago

Last modified 16 months ago

#10680 assigned enhancement

Provide more statistics on current public bridges

Reported by: sysrqb Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords: bridgedb-parsers
Cc: asn, lunar, karsten, isis, yawning Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

It will be very useful to understand the attributes of the current public bridges, most importantly the platforms on which they run, available pluggable transports, and which tor versions are being used. We should use the sanitized descriptors to obtain this information. Do we want anything else?

Child Tickets

Attachments (3)

bridges.csv (360 bytes) - added by sysrqb 5 years ago.
Version 1 bridges.csv
bridges.2.csv (343 bytes) - added by sysrqb 5 years ago.
Counts for all bridges and ec2bridges
bridges.csv_all (48.3 KB) - added by sysrqb 5 years ago.
Counts from the last three days

Download all attachments as: .zip

Change History (23)

comment:1 Changed 5 years ago by karsten

Cc: karsten added

I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:

  • Allow the script to be run on a periodically updated local directory containing recent descriptors, for example, by running rsync -arz --delete --exclude 'relay-descriptors/votes' metrics.torproject.org::metrics-recent in.
  • Remove all bridges that didn't have the Running flag in a bridge network status. Only include server descriptors referenced from bridge network statuses, and only include extra-info descriptors referenced from server descriptors.
  • Add the number of bridges in the EC2 cloud, that is, bridges whose nickname starts with "ec2bridger".
  • Produce a bridges.csv output file similar to servers.csv that can be merged with a new relays.csv (produced by current metrics-web) into the new servers.csv.

If you're interested in writing such a script, I'd want to run it on yatei and write the glue code to include your results on the metrics website including making new graphs visualizing the new data.

comment:2 Changed 5 years ago by asn

Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.

comment:3 in reply to:  2 ; Changed 5 years ago by karsten

Replying to asn:

Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.

Please create a ticket for this, and I'll reply there.

comment:4 in reply to:  3 ; Changed 5 years ago by karsten

Replying to karsten:

Replying to asn:

Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.

Please create a ticket for this, and I'll reply there.

Or wait, let me take that back. I'll reply here, because maybe the two things are related.

Distribution of bridges into pools could be made a part of the statistics produced here. The script would have to parse sanitized bridge pool assignments in addition to sanitized bridge descriptors.

Number of requests per day isn't something that BridgeDB exports right now. If we want that data, we'll have to specify a file format.

comment:5 in reply to:  1 ; Changed 5 years ago by sysrqb

Replying to karsten:

I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:

I think this would be great! I think the only data point we can't supply is the country of the bridge, but that's not a huge loss. I can definitely adapt the script to produce the necessary info and output to csv. We might actually want more information, so I might make this the default output (or produce the csv by providing a command line option). Another option is to create a script specifically for this and create a second script that produces a superset of metrics/bridge attributes.

  • Allow the script to be run on a periodically updated local directory containing recent descriptors, for example, by running rsync -arz --delete --exclude 'relay-descriptors/votes' metrics.torproject.org::metrics-recent in.
  • Remove all bridges that didn't have the Running flag in a bridge network status. Only include server descriptors referenced from bridge network statuses, and only include extra-info descriptors referenced from server descriptors.
  • Add the number of bridges in the EC2 cloud, that is, bridges whose nickname starts with "ec2bridger".
  • Produce a bridges.csv output file similar to servers.csv that can be merged with a new relays.csv (produced by current metrics-web) into the new servers.csv.

None of these should be a problem.

If you're interested in writing such a script, I'd want to run it on yatei and write the glue code to include your results on the metrics website including making new graphs visualizing the new data.

You'll accept a python script? :) I can write it in java, if you prefer, though.

comment:6 in reply to:  4 Changed 5 years ago by sysrqb

Replying to karsten:

Replying to karsten:

Replying to asn:

Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.

Please create a ticket for this, and I'll reply there.

Or wait, let me take that back. I'll reply here, because maybe the two things are related.

Distribution of bridges into pools could be made a part of the statistics produced here. The script would have to parse sanitized bridge pool assignments in addition to sanitized bridge descriptors.

I don't think this would be too difficult, but I don't think it will be a very interesting graph. Maybe some people will find it useful.

Number of requests per day isn't something that BridgeDB exports right now. If we want that data, we'll have to specify a file format.

Right, we'll need to specify a format, and we'll also need to decide what is contained in a sanitized bridgedb log. After we decide on that then we should be able to provide metrics such as the number of total requests, requests for specific a transport, requests using a specific language, etc, which may help us help the users.

comment:7 in reply to:  5 ; Changed 5 years ago by karsten

Replying to sysrqb:

Replying to karsten:

I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:

I think this would be great! I think the only data point we can't supply is the country of the bridge, but that's not a huge loss.

Right, we'd have to include country codes in sanitized bridge descriptors for that. But we also don't have country codes of relays these days, so that's fine. Future work.

I can definitely adapt the script to produce the necessary info and output to csv. We might actually want more information, so I might make this the default output (or produce the csv by providing a command line option). Another option is to create a script specifically for this and create a second script that produces a superset of metrics/bridge attributes.

What additional information would you want to include? Maybe we can extend the CSV file format? In theory, the format should allow pretty much everything you'd want to include in a graph.

  • Allow the script to be run on a periodically updated local directory containing recent descriptors, for example, by running rsync -arz --delete --exclude 'relay-descriptors/votes' metrics.torproject.org::metrics-recent in.
  • Remove all bridges that didn't have the Running flag in a bridge network status. Only include server descriptors referenced from bridge network statuses, and only include extra-info descriptors referenced from server descriptors.
  • Add the number of bridges in the EC2 cloud, that is, bridges whose nickname starts with "ec2bridger".
  • Produce a bridges.csv output file similar to servers.csv that can be merged with a new relays.csv (produced by current metrics-web) into the new servers.csv.

None of these should be a problem.

If you're interested in writing such a script, I'd want to run it on yatei and write the glue code to include your results on the metrics website including making new graphs visualizing the new data.

You'll accept a python script? :) I can write it in java, if you prefer, though.

Python is fine! If you stick to the requirements above with all input data coming from the rsync'ed directory and all output data being one or more .csv files, then that's all I need to integrate your script into metrics-web. Feel free to start hacking on this in a metrics-tasks.git branch, and we'll move over the result to metrics-web when it's ready.

comment:8 in reply to:  7 Changed 5 years ago by sysrqb

Replying to karsten:

Replying to sysrqb:

Replying to karsten:

I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:

I think this would be great! I think the only data point we can't supply is the country of the bridge, but that's not a huge loss.

Right, we'd have to include country codes in sanitized bridge descriptors for that. But we also don't have country codes of relays these days, so that's fine. Future work.

I can definitely adapt the script to produce the necessary info and output to csv. We might actually want more information, so I might make this the default output (or produce the csv by providing a command line option). Another option is to create a script specifically for this and create a second script that produces a superset of metrics/bridge attributes.

What additional information would you want to include? Maybe we can extend the CSV file format? In theory, the format should allow pretty much everything you'd want to include in a graph.

The only metric we (I?) are specifically looking at right now is the number of bridges that correctly configure their ExtOR port. Metrics currently tells us very useful information about PT usage, but we don't actually know how many bridges are providing this information. We may also want other metrics so we can answer more interesting questions after we see these.

You'll accept a python script? :) I can write it in java, if you prefer, though.

Python is fine! If you stick to the requirements above with all input data coming from the rsync'ed directory and all output data being one or more .csv files, then that's all I need to integrate your script into metrics-web. Feel free to start hacking on this in a metrics-tasks.git branch, and we'll move over the result to metrics-web when it's ready.

Excellent. I'll start integrating what we've discussed and start commiting it to a metrics-tasks repo.

comment:9 Changed 5 years ago by isis

Cc: isis added
Keywords: bridgedb-parsers added

I'm adding the keyword bridgedb-parsers not because I think that this code should live in BridgeDB, but instead because I still think it would be really nice if there was one canonical place to put Python descriptor parsers.

Not that you should feel pressured to do this, since it's a lot of extra work. :)

comment:10 in reply to:  9 ; Changed 5 years ago by sysrqb

Replying to isis:

I'm adding the keyword bridgedb-parsers not because I think that this code should live in BridgeDB, but instead because I still think it would be really nice if there was one canonical place to put Python descriptor parsers.

No argument from me :) but 1) I don't think we'll have a complete module before this ticket is closed and 2) I don't know how Karsten feels about dependencies in the metrics-tasks repo. I created #10725 for your pleasure and thoughts. :) (now hopefully this ticket won't get derailed too much :)

Not that you should feel pressured to do this, since it's a lot of extra work. :)

comment:11 in reply to:  10 ; Changed 5 years ago by karsten

Replying to sysrqb:

2) I don't know how Karsten feels about dependencies in the metrics-tasks repo.

Well, dependencies in metrics-web is what I care more about, and that's where this new code is supposed to live eventually, right? For metrics-web, a dependency on stem as Git submodule would be fine (and should be sufficient for parsing bridge descriptors). Also, more generally, dependencies on packages in Debian wheezy or wheezy-backports are fine, too. Oh, and you might add libraries you depend on to metrics-web itself, though that's rather ugly.

comment:12 in reply to:  11 Changed 5 years ago by sysrqb

Replying to karsten:

Replying to sysrqb:

2) I don't know how Karsten feels about dependencies in the metrics-tasks repo.

Well, dependencies in metrics-web is what I care more about, and that's where this new code is supposed to live eventually, right?

True.

For metrics-web, a dependency on stem as Git submodule would be fine (and should be sufficient for parsing bridge descriptors). Also, more generally, dependencies on packages in Debian wheezy or wheezy-backports are fine, too. Oh, and you might add libraries you depend on to metrics-web itself, though that's rather ugly.

Yeah, I wasn't sure how you felt about this. Originally I wrote a custom parser for this script, but I rewrote it and now I'm using Stem.

I'm attaching the first version of the csv. Let me know what you think. The lines are a bit out-of-order, it shouldn't matter when the csv is parsed but I can fix it if you want. I'll will add support for the EC2 bridges later tonight, hopefully

Changed 5 years ago by sysrqb

Attachment: bridges.csv added

Version 1 bridges.csv

comment:13 in reply to:  1 Changed 5 years ago by sysrqb

Replying to karsten:

I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:

  • Allow the script to be run on a periodically updated local directory containing recent descriptors,

Just to be clear, this script is not expected to handle calculating the average values over the day, correct? This will be handled by glue code before the values are merged into servers.csv?

comment:14 Changed 5 years ago by karsten

Sure, I can write the code to calculate averages when merging your output with relay statistics.

comment:15 Changed 5 years ago by sysrqb

Great! I'm attaching two files, one contains the number of bridges based on a single NS (the last one available when I synced last night), the second file contains the bridge counts from every NS from the last three days. I'm not sure what will be easiest for you. I can also add an option to only parse the networkstatuses that were published on a certain day. Right now, with the options I implemented, the script can be run over the entire three-day set and you can parse the resulting csv and choose the days you want, or you can run the script every hour and create a csv from the "most-recently published" networkstatus. I'm completely open to adding functionality if it will make the process easier.

Note, the lines in the csv are not sorted, I can order the output by date and also sort the platform and version lines, so they're not interleaved, if that will help.

Current functionality:

usage: bridge_attributes.py [-h] [-d DESC] [-e EI] [-n NS] [-s NSFILE]
                            [-o OUTPUT] [-O OUTPUT_NAME] [-a]

Obtain bridge metrics

optional arguments:
  -h, --help            show this help message and exit
  -d DESC, --desc DESC  The directory that contains bridge descriptors
  -e EI, --ei EI        The directory that contains bridge extra-info
                        documents
  -n NS, --ns NS        The directory that contains bridge networkstatus
                        documents
  -s NSFILE, --nsfile NSFILE
                        The file path to a specific bridge networkstatus
                        document
  -o OUTPUT, --output OUTPUT
                        The directory where the output is saved (default: cwd)
  -O OUTPUT_NAME, --output-name OUTPUT_NAME
                        The filename to where the output is saved
                        (default: bridges.csv)
  -a, --parse-all       Parse all documents, not only the most recent

Changed 5 years ago by sysrqb

Attachment: bridges.2.csv added

Counts for all bridges and ec2bridges

Changed 5 years ago by sysrqb

Attachment: bridges.csv_all added

Counts from the last three days

comment:16 Changed 5 years ago by sysrqb

I still need to do a little more clean up and create a readme, but you can find the current version here

comment:17 Changed 5 years ago by karsten

Looks like a fine start! I'll comment on the output csv files first:

  • The "ec2bridge" column in current servers.csv is actually a boolean type, not a number type. It means that whenever there's a "t" in that column, the "bridges" column contains the number of bridges that in the EC2 cloud. What you're doing is you're combining two dimensions, version and ec2bridge, by reporting how many of the EC2 bridges are running Linux. The current server.csv does not combine dimensions, so there's just one line for the number of Linux bridges and one line for the number of EC2 bridges. That's sufficient for most use cases, so I'd say let's not combine dimensions for now.
  • The column headers should not be repeated for every bridge status. You could check if the output csv file exists and only write the header line if it doesn't.

Regarding options to run your script: I'd appreciate a default mode of operation that processes only those bridge statuses that it did not process in an earlier run. I think stem has an option to keep a parse history of some kind that you might be able to use here. Note that you'll have to re-read server descriptors and extra-info descriptors in any case, because they might be referenced from many statuses.

And finally, here are some quick comments on the code, though I can do another, more thorough code review later:

  • Bridge has quite a few attributes that we won't need. For example, os_version isn't something we include in the output. And we wouldn't include versions of other Tor-speaking programs like nTor anytime soon (but rather count them as "other" versions). Oh, and there are no usable contact lines in bridge descriptors, so we don't need the contact attribute. I guess what I'm saying is that this is dead code that shouldn't be there. YAGNI.
  • I didn't see where you store the bridge status publication time in Bridge.
  • Both init and set_descriptor_details could accept stem objects rather than several single parameters.
  • unpadded_base64_to_base_16 looks like something that stem should do for you. If it doesn't, you should ask atagar to implement it in stem.

I didn't make it further through the code yet, but I'm happy to do another review soon. Let me know!

Thanks!

comment:18 Changed 4 years ago by yawning

Cc: yawning added

comment:19 Changed 16 months ago by karsten

Component: Metrics/AnalysisMetrics/Statistics
Severity: Normal
Summary: Obtain attributes of current public bridgesProvide more statistics on current public bridges
Type: taskenhancement

Some parts are outdated, but some may still be relevant. In any case, Metrics/Statistics is likely a better place for this ticket.

comment:20 Changed 16 months ago by karsten

Owner: set to metrics-team
Status: newassigned
Note: See TracTickets for help on using tickets.