Opened 2 years ago

Closed 17 months ago

#9889 closed enhancement (implemented)

Write a script to either confirm or confute Weather's "Tshirt: yes/no"

Reported by: arma Owned by: hellais
Priority: normal Milestone:
Component: Atlas Version:
Keywords: Cc: karsten, lunar, sreenatha.dev@…
Actual Points: Parent ID:
Points:

Description

Right now we have a service called Tor Weather that is supposed to monitor the consensus and look for relays that qualify for a tshirt:
https://www.torproject.org/getinvolved/tshirt

The qualification criteria include this pretty simple one:

  • Operate a fast Tor relay that's been running for the past two months: you are eligible if you allow exits to port 80 and you average 100 KBytes/s traffic, or if you're not an exit but you average 500 KBytes/s traffic.

The trouble is that we not infrequently get mail from relay operators who appear to qualify but Tor weather hasn't sent them a mail. The latest one is:
https://atlas.torproject.org/#details/DA24B9CD2AA8C02B9068A168C420DC302D969937

This behavior is probably a Tor weather bug. But is it "Tor weather doesn't think you deserve one yet" or "something else"? We could help narrow down the issue, as well as help *operators* narrow down the issue, if Atlas et al formed their own opinion and said it on the relay's page.

For clarification, I think we should interpret the requirement as "take the average actually-used bandwidth over the past 60 days, counting a 0 for all the time your relay isn't up".

And heck, if this starts working well we could simplify Tor weather to just check one of these sites to see if you qualify yet.

Speaking of which, should this be an Onionoo ticket?

Child Tickets

Change History (14)

comment:1 Changed 2 years ago by arma

  • Cc karsten lunar added

Adding some more people in hopes they can help decide the right trac component.

comment:2 Changed 2 years ago by karsten

Onionoo has the data you're looking for. Here's how you'd ask for it:

So, should there be a "Tshirt: yes/no" in Atlas and Globe? Maybe not. An unexpected "no" or "yes" would only leave people wondering which of the requirements they failed to meet.

How about Somebody (TM) writes a simple script to run the steps above and provide debugging information why the relay operator should have a t-shirt or not? There can be many reasons, like the relay not being around long enough, too much downtime, not permitting port 80, etc. This script could be a shell script or a simple website with a JavaScript part. This could be a fine volunteer task.

The better long-term fix would be to make Weather use Onionoo's data for all its operation. In fact, it's one of the example use cases I had in mind when designing Onionoo (search for "Weather" on https://www.torproject.org/projects/onionoo.html.en). That rewritten Weather service would use Onionoo to:

  • search for relays, similar to how Atlas and Globe perform searches, which would allow us to stop linking to metrics' relay search for searching for a relay fingerprint (of course, it could also link to Atlas for that search);
  • check the last_seen field of a subscribed relay to detect if a relay goes offline; and
  • implement the check above to decide whether a relay operator should have a t-shirt.

I bet the codebase of that rewritten Weather would be much smaller than of the current Weather. All the hard work is already done by Onionoo. This rewrite would also be a fine volunteer task.

But let's do the short-term things before the long-term things: write a script (Python, JavaScript, $your-favorite-scripting-language-that-can-parse-JSON, etc.) to perform the steps sketched out above. Anybody?

comment:3 Changed 18 months ago by sreenatha

  • Cc sreenatha.dev@… added

Hi. I've written a python script to address this issue.

https://github.com/lucyd/weather-rewrite/blob/master/tshirt.py

Some minor points regarding the script :

  • In an attempt to make the script modular, I've designed it to make atmost 3 onionoo requests in total(2 for details document and 1 for bandwidth document). This can be brought down to 2(1 for details document and 1 for bandwidth document).
  • Considered 2 months as 2 * 30 * 86400 seconds i.e, 60 days.
  • Used Atlas's logic of considering 'last_restarted' field in relay's details document to calculate the uptime.
  • Null values have been assumed to be represented as 'null' in the onionoo documents.

comment:4 Changed 18 months ago by karsten

  • Status changed from new to needs_revision

Hi Sreenatha! Thanks for working on this script. A few comments:

  • Want to clone the metrics-tasks.git repository and add your script under a new task-9889/ directory? We can still use your script for the Weather rewrite, but having it in metrics-tasks.git means it won't get lost, regardless of how the Weather rewrite goes.
  • Making 2 requests for details document is something we can tweak later. But did you consider using Onionoo's fields parameter to reduce response size of details documents? (Note that this doesn't work for bandwidth documents.)
  • Considering 2 months as 60 days is perfectly fine as simplification.
  • Using last_restarted to calculate uptime is not a good idea. It means people are encouraged not to restart their relay if they want to have a t-shirt. But restarts are necessary to upgrade in case of security fixes. How about you use the recently added uptime documents? You could say that a relay needs to be around at least 95% in the past two months in order to be considered.
  • Not sure what you mean above about handling null values. See the protocol specification for when Onionoo puts in null.
  • If I provide a non-existing fingerprint, I get an IndexError. You should probably check that the relays object contains at least one object.
  • Rather than only accepting fingerprints as input, how about you use Onionoo's search parameter and accept anything as search term that Atlas/Globe accept? Of course, that would mean processing possibly more than 1 relay at the same time.
  • There seems to be some duplicate code for fetching things from Onionoo and processing the result. Maybe you could move that to a function?
  • Not sure how Pythonic this is, but usually, when you say "if x: return True; else return False;", the better way of saying it is "return x == True;" or simply "return x;".
  • I didn't review the math behind check_tshirt() yet, but it would help if you could print details like computed uptime, computed average bandwidth, and computed result whether the relay allows exiting to port 80. That would be in addition to your end result why you think the relay qualifies for a t-shirt and why or why not.

Again, thanks for working on this! Looking forward to doing a second code review!

comment:5 Changed 18 months ago by sreenatha

Hi Karsten! Thank you for the feedback. I've cloned the metrics-tasks repository and added a task-9889 directory as you suggested. After some time hacking around, I've come up with two versions of the tshirt code(with and without multi-threading), both of which can be found at the following URL.

https://github.com/lucyd/metrics-tasks/commits/mybranch/task-9889/tshirt.py

  • In the normal approach, I've incorporated the changes suggested along with some other that were missing previously like checking in the port ranges of exit_policy_summary.
  • In the multi-threaded approach, each of the threads invoked handles a single relay's data.
  • The check_tshirt method fetches required onionoo documents and stores them in global variables for the threads to access individual entries.
  • Synchronization across threads which share the global variables has been ensured using a binary lock.
  • On execution, each of the threads calculates the relay uptime, bandwidth and exit allowance details. After calculation, the thread lock is acquired and the information is sent to print_debug_info before releasing the lock for other threads.
  • There is no strict order of execution of the threads. I don't see how that can be a problem though.

Should I split the script into two files? One part could fetch the bandwidth, uptime, details documents and store them locally while the second part reads these documents and runs the threads. These two scripts could then be run sequentially with a shell script.

comment:6 Changed 18 months ago by karsten

Hi Sreenatha! Looks good! A few more comments below:

  • I pushed your changes to the metrics-tasks.git. I also fixed a trivial issue with checking exit ports.
  • What's the reason for using threads? In theory, you have to make three downloads with search=$searchterm in any case, regardless of the search query, and you cannot process results before having all three responses. Why add the complexity of using threads?
  • Some ideas for making calculate_sum prettier:
    • Calculating the sum in this function and dividing it outside of it doesn't seem like what you want to do. You're really interested in the average, both for bandwidth and uptime. Why not turn this function into calculate_2mo_avg? In fact, callers shouldn't have to mess with the graph history object at all, but that function should do everything from removing data more than 2 months ago, summing up values, dividing by count, and returning the average.
    • The cleaner approach to skipping data that is more than 2 months old is to start at start and move forward in interval steps until you're at a datetime that's within the past 2 months. The reason is that Onionoo bandwidth and uptime files don't contain exactly the past, say, 3 months from the time you do the fetch, but they may return, for example, 3 months 1 week ago to 1 week ago. If you look at start and interval, you're on the safe side.
    • The calculate_sum function shouldn't divide by 1000.0, because that's specific to bandwidth and not to uptime.
    • You need to multiply all array elements in values with factor, even in uptime documents.
    • You're currently not adding null values to the sum, but you're counting them when computing the average. For example, for 1, null, 2 you calculate an average of 3 / 3 = 1, but really the average is 3 / 2 = 1.5.

Want to do another revision?

comment:7 Changed 18 months ago by sreenatha

Yeah, definitely.

Come to think of it, what you said regarding the threading overhead makes sense. Having a single method to handle all the relays will avoid unnecessary complexity of synchronizing the threads.

comment:8 Changed 18 months ago by karsten

arma says: "Here's a hint: I wonder if Tor weather is looking at consensus weights over time, since nobody ever updated it when we added the bwauths." Something to look into when we have a working script.

comment:9 Changed 18 months ago by sreenatha

https://github.com/lucyd/metrics-tasks/commit/2f8696e1492c23a2d08562b2dead973ef0d82d2e

  1. Are the eligibility criteria in print_debug_info evaluated in the right order? Should the script print all the reasons for failing the criteria?
  2. Is it better if we store the documents fetched from onionoo locally? In that case, we can add the If-Modified-Since field in the request header to increase the speed and reduce load on onionoo.

comment:10 Changed 18 months ago by karsten

Regarding your questions, the evaluation in print_debug_info looks fine, and storing fetched documents locally doesn't seem necessary at this point.

Your updated branch looks good! I made a few more fixes here: https://gitweb.torproject.org/karsten/metrics-tasks.git/shortlog/refs/heads/task-9889. If you like them, I'll push this branch to the official metrics-tasks.git.

Thanks!

comment:11 Changed 17 months ago by karsten

Pushed to metrics-tasks.git. arma, does this resolve this ticket?

comment:12 Changed 17 months ago by meejah

hi sreenatha,

Karsten asked me to take a look at this branch, too. Overall, it looks really good! I just have a couple minor suggestions:

  1. For dicts, the recommended way to check for a key is just "if key in dict" not "if key in dict.keys()", for example on line 82 of https://gitweb.torproject.org/karsten/metrics-tasks.git/blob/2f8696e1492c23a2d08562b2dead973ef0d82d2e:/task-9889/tshirt.py it would just be: if '3_months' not in responseuptime?
  1. I would recommend using the tool "pep8" to check your syntax. PEP8 recommends 4-spacing indenting, so you'll have to use "pep8 --ignore=E111" to turn that off since this is using 2-space. I think it's more important to follow whatever is in the rest of the repository, but in general 4-space indenting is what most Python code does. I see some warnings for mixed tabs and spaces, so you might want to check your editor configuration.
  1. It would be nice to accept 1 or more command-line options for relay hashes to check, and only ask in case there are none.
  1. The code is well-documented and in general easy to follow.
  1. Besides useful doc-strings, you've included some appropriate inline comments, which is good as well!

If you've never heard of the "requests" library, it is worth checking out for things like this -- it's a lot easier to use than urllib2 and checks SSL certs by default (which urllib2 does not). I'm not suggesting you switch this out in the t-shirt script, this is just for future reference :)

comment:13 Changed 17 months ago by sreenatha

Hi meejah,

I modified the script to reflect some of the changes you suggested.

https://github.com/lucyd/metrics-tasks/commit/0dabd9903c8005d344a3860063fe5c44f5cf2e8f

However, there seems to be no specific indentation scheme that's followed throughout the metrics-tasks repository. So, I stuck to the previous 2-space indent as some of the print statements are pretty lengthy and having a 4-space indent would split almost every print statement into 2 lines. If it is preferred, anyway, please let me know and I'll change it accordingly.

comment:14 Changed 17 months ago by karsten

  • Resolution set to implemented
  • Status changed from needs_revision to closed
  • Summary changed from Add "Tshirt: yes/no" to Atlas and Globe to Write a script to either confirm or confute Weather's "Tshirt: yes/no"

Thanks, sreenatha. I pushed your commit to the official metrics-tasks repo.

Regarding 2-space vs. 4-space, there's indeed no common scheme in the metrics-tasks repository. I'd say leave it as is, and switch to 4-space for new tasks and when re-using this code in other projects that already do 4-space.

Changing the ticket summary to what we actually did to resolve it, and resolving it. Thanks!

Note: See TracTickets for help on using tickets.