Opened 8 years ago

Closed 8 years ago

#4425 closed defect (fixed)

Monitor and tweak PID feedback control

Reported by: mikeperry Owned by: mikeperry
Priority: Medium Milestone:
Component: Core Tor/Torflow Version:
Severity: Keywords: MikePerryIteration20111120
Cc: Actual Points: 20
Parent ID: Points: 6
Reviewer: Sponsor:

Description

I need to watch my #1976 code for a bit, and then merge my pid_control branch down, and then generate some graphs and watch the results closely.

Child Tickets

Attachments (2)

bandwidth-comparison-relays-votes.png (137.2 KB) - added by mikeperry 8 years ago.
Realy fraction graph for Nov 11
bandwidth-comparison-measured-votes.png (131.8 KB) - added by mikeperry 8 years ago.
Measured fraction graph from Nov 11

Download all attachments as: .zip

Change History (15)

comment:1 Changed 8 years ago by karsten

As for generating some graphs, does the #2394 code work for you?

comment:2 in reply to:  1 Changed 8 years ago by mikeperry

Replying to karsten:

As for generating some graphs, does the #2394 code work for you?

I am getting 400 errors from metrics.tp.o for the example urls in the README:
https://gitweb.torproject.org/metrics-tasks.git/blob/HEAD:/task-2394/README

Also, is there a magic parameter to say "most-recent-consensus" for those three urls? I'd like to have this scripted and logging graphs for every consensus, so I can watch the changes as I try tweaking various things.

comment:3 Changed 8 years ago by mikeperry

Ok, I think I just mispasted. I got the urls working now. Still wondering about a "latest-consensus" shortcut for scripting. If it doesn't exist, I guess I could just hack date to somehow give me the last hour UTC? I guess that's what I need?

comment:4 Changed 8 years ago by mikeperry

Ok, with the example consensus in the README, I get 2447 lines of output like:
"We're missing descriptor 5514060a9697a4bd52c206a08c262abb9bf1b66d. Please make sure that all referenced server descriptors are available. Continuing anyway."

I assume it is not finding the descriptors file? I have them here:

metrics-tasks.git/task-2394# ls descriptors/
2011-07-13-05-00-00-consensus  2011-07-13-05-00-00-descriptors  2011-07-13-05-00-00-votes

That filename matches the one you have in the README.

comment:5 in reply to:  3 Changed 8 years ago by karsten

Replying to mikeperry:

Ok, I think I just mispasted. I got the urls working now. Still wondering about a "latest-consensus" shortcut for scripting. If it doesn't exist, I guess I could just hack date to somehow give me the last hour UTC? I guess that's what I need?

No shortcuts. Something like this should work: date -u +%Y-%m-%d-%H-00-00.

comment:6 in reply to:  4 Changed 8 years ago by karsten

Replying to mikeperry:

Ok, with the example consensus in the README, I get 2447 lines of output like:
"We're missing descriptor 5514060a9697a4bd52c206a08c262abb9bf1b66d. Please make sure that all referenced server descriptors are available. Continuing anyway."

Can you do a git pull? That warning isn't in the code anymore since July.

Also, I tweaked the README a little bit by renaming 2011-07-13-05-00-00-descriptors to 2011-07-13-05-00-00-serverdesc so that it looks more like the URL. Nothing to worry about, but maybe it removes one potential error source.

Let me know if you run into any other problems.

comment:7 Changed 8 years ago by karsten

Please git pull again. I fixed a few things. You'll also like the new plot.sh.

comment:8 Changed 8 years ago by mikeperry

I do like plot.sh. Thanks for that. Though on line 5 you don't use the $COMMONS variable in your file check. My commons-codec-1.4 is a system package.

I also am getting 500 server errors for the urls. Is there an issue with metrics.tp.o having a delay for current consensuses? I tried some from yesterday though and it still gave me the same 500 error using plot.sh.

comment:9 Changed 8 years ago by karsten

Please run git pull again. I fixed line 5. I also subtracted 30 minutes from the current system time and rounded that time to the last full UTC hour. That should give the metrics host enough time to fetch the consensus we're requesting.

The 500 server errors come from the metrics database being overloaded. I just kicked Tomcat, and it works again. If the problem happens again, ask me or weasel to restart Tomcat on yatei. It's always safe to do that, and in this case it even helps. (I'm working on an improved metrics database that doesn't run into these problems, but don't hold your breath.)

Changed 8 years ago by mikeperry

Realy fraction graph for Nov 11

Changed 8 years ago by mikeperry

Measured fraction graph from Nov 11

comment:10 Changed 8 years ago by mikeperry

Karsten: The votes for urras in these two graphs look odd to me, so I want to just confirm I'm reading them right.

On the relay-votes graph, it looks like 60% of the Guard nodes have a ratio of < 0.1 (I assume 0)? Similarly, 40-50% of the Exits also have a ratio of ~0? That seems to be a lot of nodes with no capacity.. Yet urras's measurements look fine when not in PID mode. Definitely a bug somewhere, it seems..

On the measured-votes graph, it looks like urras hates Guards and Exits so much that their total measured bandwidth is way way below what the consensus has, and so their CDF only goes up to like 10-20% of the consensus total? In other words, you did not scale these graphs (which is fine). Is that also right?

comment:11 Changed 8 years ago by mikeperry

2011-11-11-14-00-00 is the exact valid-after time for those two attached files, btw.

comment:12 Changed 8 years ago by mikeperry

Ok, it looks like that is what is happening. A lot of Guards are getting 0 as a bandwidth vote.. Still not sure why yet.

Also not sure why it also hates Exits.. That might be a different bug, as I don't actually see any Exits with 0 bw in the vote file... It does look like it is hating them less as of consensuses today, though.

comment:13 Changed 8 years ago by mikeperry

Actual Points: 20
Resolution: fixed
Status: newclosed

I solved the Guard 0 bw bug as well as the exit issues and some other problems, and PID feedback code is now running on 4/5 of our bw auths. We now have a ton of knobs and options to tune, to try to find out which values provide sufficient feedback without breaking the network. I've created #4596 for that.

Note: See TracTickets for help on using tickets.