It will be very useful to understand the attributes of the current public bridges, most importantly the platforms on which they run, available pluggable transports, and which tor versions are being used. We should use the sanitized descriptors to obtain this information. Do we want anything else?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:
Allow the script to be run on a periodically updated local directory containing recent descriptors, for example, by running rsync -arz --delete --exclude 'relay-descriptors/votes' metrics.torproject.org::metrics-recent in.
Remove all bridges that didn't have the Running flag in a bridge network status. Only include server descriptors referenced from bridge network statuses, and only include extra-info descriptors referenced from server descriptors.
Add the number of bridges in the EC2 cloud, that is, bridges whose nickname starts with "ec2bridger".
Produce a bridges.csv output file similar to servers.csv that can be merged with a new relays.csv (produced by current metrics-web) into the new servers.csv.
If you're interested in writing such a script, I'd want to run it on yatei and write the glue code to include your results on the metrics website including making new graphs visualizing the new data.
Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.
Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.
Please create a ticket for this, and I'll reply there.
Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.
Please create a ticket for this, and I'll reply there.
Or wait, let me take that back. I'll reply here, because maybe the two things are related.
Distribution of bridges into pools could be made a part of the statistics produced here. The script would have to parse sanitized bridge pool assignments in addition to sanitized bridge descriptors.
Number of requests per day isn't something that BridgeDB exports right now. If we want that data, we'll have to specify a file format.
I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:
I think this would be great! I think the only data point we can't supply is the country of the bridge, but that's not a huge loss. I can definitely adapt the script to produce the necessary info and output to csv. We might actually want more information, so I might make this the default output (or produce the csv by providing a command line option). Another option is to create a script specifically for this and create a second script that produces a superset of metrics/bridge attributes.
Allow the script to be run on a periodically updated local directory containing recent descriptors, for example, by running rsync -arz --delete --exclude 'relay-descriptors/votes' metrics.torproject.org::metrics-recent in.
Remove all bridges that didn't have the Running flag in a bridge network status. Only include server descriptors referenced from bridge network statuses, and only include extra-info descriptors referenced from server descriptors.
Add the number of bridges in the EC2 cloud, that is, bridges whose nickname starts with "ec2bridger".
Produce a bridges.csv output file similar to servers.csv that can be merged with a new relays.csv (produced by current metrics-web) into the new servers.csv.
None of these should be a problem.
If you're interested in writing such a script, I'd want to run it on yatei and write the glue code to include your results on the metrics website including making new graphs visualizing the new data.
You'll accept a python script? :) I can write it in java, if you prefer, though.
Might not be the correct ticket for this, but some BridgeDB stats might also be interesting. Like the distribution of bridges into pools, the number of requests per day, etc.
Please create a ticket for this, and I'll reply there.
Or wait, let me take that back. I'll reply here, because maybe the two things are related.
Distribution of bridges into pools could be made a part of the statistics produced here. The script would have to parse sanitized bridge pool assignments in addition to sanitized bridge descriptors.
I don't think this would be too difficult, but I don't think it will be a very interesting graph. Maybe some people will find it useful.
Number of requests per day isn't something that BridgeDB exports right now. If we want that data, we'll have to specify a file format.
Right, we'll need to specify a format, and we'll also need to decide what is contained in a sanitized bridgedb log. After we decide on that then we should be able to provide metrics such as the number of total requests, requests for specific a transport, requests using a specific language, etc, which may help us help the users.
I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:
I think this would be great! I think the only data point we can't supply is the country of the bridge, but that's not a huge loss.
Right, we'd have to include country codes in sanitized bridge descriptors for that. But we also don't have country codes of relays these days, so that's fine. Future work.
I can definitely adapt the script to produce the necessary info and output to csv. We might actually want more information, so I might make this the default output (or produce the csv by providing a command line option). Another option is to create a script specifically for this and create a second script that produces a superset of metrics/bridge attributes.
What additional information would you want to include? Maybe we can extend the CSV file format? In theory, the format should allow pretty much everything you'd want to include in a graph.
Allow the script to be run on a periodically updated local directory containing recent descriptors, for example, by running rsync -arz --delete --exclude 'relay-descriptors/votes' metrics.torproject.org::metrics-recent in.
Remove all bridges that didn't have the Running flag in a bridge network status. Only include server descriptors referenced from bridge network statuses, and only include extra-info descriptors referenced from server descriptors.
Add the number of bridges in the EC2 cloud, that is, bridges whose nickname starts with "ec2bridger".
Produce a bridges.csv output file similar to servers.csv that can be merged with a new relays.csv (produced by current metrics-web) into the new servers.csv.
None of these should be a problem.
If you're interested in writing such a script, I'd want to run it on yatei and write the glue code to include your results on the metrics website including making new graphs visualizing the new data.
You'll accept a python script? :) I can write it in java, if you prefer, though.
Python is fine! If you stick to the requirements above with all input data coming from the rsync'ed directory and all output data being one or more .csv files, then that's all I need to integrate your script into metrics-web. Feel free to start hacking on this in a metrics-tasks.git branch, and we'll move over the result to metrics-web when it's ready.
I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:
I think this would be great! I think the only data point we can't supply is the country of the bridge, but that's not a huge loss.
Right, we'd have to include country codes in sanitized bridge descriptors for that. But we also don't have country codes of relays these days, so that's fine. Future work.
I can definitely adapt the script to produce the necessary info and output to csv. We might actually want more information, so I might make this the default output (or produce the csv by providing a command line option). Another option is to create a script specifically for this and create a second script that produces a superset of metrics/bridge attributes.
What additional information would you want to include? Maybe we can extend the CSV file format? In theory, the format should allow pretty much everything you'd want to include in a graph.
The only metric we (I?) are specifically looking at right now is the number of bridges that correctly configure their ExtOR port. Metrics currently tells us very useful information about PT usage, but we don't actually know how many bridges are providing this information. We may also want other metrics so we can answer more interesting questions after we see these.
You'll accept a python script? :) I can write it in java, if you prefer, though.
Python is fine! If you stick to the requirements above with all input data coming from the rsync'ed directory and all output data being one or more .csv files, then that's all I need to integrate your script into metrics-web. Feel free to start hacking on this in a metrics-tasks.git branch, and we'll move over the result to metrics-web when it's ready.
Excellent. I'll start integrating what we've discussed and start commiting it to a metrics-tasks repo.
I'm adding the keyword bridgedb-parsersnot because I think that this code should live in BridgeDB, but instead because I still think it would be really nice if there was one canonical place to put Python descriptor parsers.
Not that you should feel pressured to do this, since it's a lot of extra work. :)
I'm adding the keyword bridgedb-parsersnot because I think that this code should live in BridgeDB, but instead because I still think it would be really nice if there was one canonical place to put Python descriptor parsers.
No argument from me :) but 1) I don't think we'll have a complete module before this ticket is closed and 2) I don't know how Karsten feels about dependencies in the metrics-tasks repo. I created #10725 (moved) for your pleasure and thoughts. :) (now hopefully this ticket won't get derailed too much :)
Not that you should feel pressured to do this, since it's a lot of extra work. :)
I don't know how Karsten feels about dependencies in the metrics-tasks repo.
Well, dependencies in metrics-web is what I care more about, and that's where this new code is supposed to live eventually, right? For metrics-web, a dependency on stem as Git submodule would be fine (and should be sufficient for parsing bridge descriptors). Also, more generally, dependencies on packages in Debian wheezy or wheezy-backports are fine, too. Oh, and you might add libraries you depend on to metrics-web itself, though that's rather ugly.
I don't know how Karsten feels about dependencies in the metrics-tasks repo.
Well, dependencies in metrics-web is what I care more about, and that's where this new code is supposed to live eventually, right?
True.
For metrics-web, a dependency on stem as Git submodule would be fine (and should be sufficient for parsing bridge descriptors). Also, more generally, dependencies on packages in Debian wheezy or wheezy-backports are fine, too. Oh, and you might add libraries you depend on to metrics-web itself, though that's rather ugly.
Yeah, I wasn't sure how you felt about this. Originally I wrote a custom parser for this script, but I rewrote it and now I'm using Stem.
I'm attaching the first version of the csv. Let me know what you think. The lines are a bit out-of-order, it shouldn't matter when the csv is parsed but I can fix it if you want. I'll will add support for the EC2 bridges later tonight, hopefully
I wonder, should we use the output of your script to complement the servers.csv file provided by metrics-web? There are a few requirements for that, though:
Allow the script to be run on a periodically updated local directory containing recent descriptors,
Just to be clear, this script is not expected to handle calculating the average values over the day, correct? This will be handled by glue code before the values are merged into servers.csv?
Great! I'm attaching two files, one contains the number of bridges based on a single NS (the last one available when I synced last night), the second file contains the bridge counts from every NS from the last three days. I'm not sure what will be easiest for you. I can also add an option to only parse the networkstatuses that were published on a certain day. Right now, with the options I implemented, the script can be run over the entire three-day set and you can parse the resulting csv and choose the days you want, or you can run the script every hour and create a csv from the "most-recently published" networkstatus. I'm completely open to adding functionality if it will make the process easier.
Note, the lines in the csv are not sorted, I can order the output by date and also sort the platform and version lines, so they're not interleaved, if that will help.
Current functionality:
usage: bridge_attributes.py [-h] [-d DESC] [-e EI] [-n NS] [-s NSFILE] [-o OUTPUT] [-O OUTPUT_NAME] [-a]Obtain bridge metricsoptional arguments: -h, --help show this help message and exit -d DESC, --desc DESC The directory that contains bridge descriptors -e EI, --ei EI The directory that contains bridge extra-info documents -n NS, --ns NS The directory that contains bridge networkstatus documents -s NSFILE, --nsfile NSFILE The file path to a specific bridge networkstatus document -o OUTPUT, --output OUTPUT The directory where the output is saved (default: cwd) -O OUTPUT_NAME, --output-name OUTPUT_NAME The filename to where the output is saved (default: bridges.csv) -a, --parse-all Parse all documents, not only the most recent
The "ec2bridge" column in current servers.csv is actually a boolean type, not a number type. It means that whenever there's a "t" in that column, the "bridges" column contains the number of bridges that in the EC2 cloud. What you're doing is you're combining two dimensions, version and ec2bridge, by reporting how many of the EC2 bridges are running Linux. The current server.csv does not combine dimensions, so there's just one line for the number of Linux bridges and one line for the number of EC2 bridges. That's sufficient for most use cases, so I'd say let's not combine dimensions for now.
The column headers should not be repeated for every bridge status. You could check if the output csv file exists and only write the header line if it doesn't.
Regarding options to run your script: I'd appreciate a default mode of operation that processes only those bridge statuses that it did not process in an earlier run. I think stem has an option to keep a parse history of some kind that you might be able to use here. Note that you'll have to re-read server descriptors and extra-info descriptors in any case, because they might be referenced from many statuses.
And finally, here are some quick comments on the code, though I can do another, more thorough code review later:
Bridge has quite a few attributes that we won't need. For example, os_version isn't something we include in the output. And we wouldn't include versions of other Tor-speaking programs like nTor anytime soon (but rather count them as "other" versions). Oh, and there are no usable contact lines in bridge descriptors, so we don't need the contact attribute. I guess what I'm saying is that this is dead code that shouldn't be there. YAGNI.
I didn't see where you store the bridge status publication time in Bridge.
Both init and set_descriptor_details could accept stem objects rather than several single parameters.
unpadded_base64_to_base_16 looks like something that stem should do for you. If it doesn't, you should ask atagar to implement it in stem.
I didn't make it further through the code yet, but I'm happy to do another review soon. Let me know!
Some parts are outdated, but some may still be relevant. In any case, Metrics/Statistics is likely a better place for this ticket.
Trac: Sponsor: N/AtoN/A Summary: Obtain attributes of current public bridges to Provide more statistics on current public bridges Type: task to enhancement Component: Metrics/Analysis to Metrics/Statistics Reviewer: N/AtoN/A Severity: N/Ato Normal