Monitor and tweak PID feedback control

added MikePerryIteration20111120 actualpoints::20 component::core tor/torflow owner::mikeperry points::6 priority::medium resolution::fixed status::closed type::defect labels

As for generating some graphs, does the #2394 (moved) code work for you?

Replying to karsten:

As for generating some graphs, does the #2394 (moved) code work for you?

I am getting 400 errors from metrics.tp.o for the example urls in the README: https://gitweb.torproject.org/metrics-tasks.git/blob/HEAD:/task-2394/README

Also, is there a magic parameter to say "most-recent-consensus" for those three urls? I'd like to have this scripted and logging graphs for every consensus, so I can watch the changes as I try tweaking various things.

Ok, I think I just mispasted. I got the urls working now. Still wondering about a "latest-consensus" shortcut for scripting. If it doesn't exist, I guess I could just hack date to somehow give me the last hour UTC? I guess that's what I need?

Ok, with the example consensus in the README, I get 2447 lines of output like: "We're missing descriptor 5514060a9697a4bd52c206a08c262abb9bf1b66d. Please make sure that all referenced server descriptors are available. Continuing anyway."

I assume it is not finding the descriptors file? I have them here:

metrics-tasks.git/task-2394# ls descriptors/
2011-07-13-05-00-00-consensus  2011-07-13-05-00-00-descriptors  2011-07-13-05-00-00-votes

That filename matches the one you have in the README.

Replying to mikeperry:

Ok, I think I just mispasted. I got the urls working now. Still wondering about a "latest-consensus" shortcut for scripting. If it doesn't exist, I guess I could just hack date to somehow give me the last hour UTC? I guess that's what I need?

No shortcuts. Something like this should work: date -u +%Y-%m-%d-%H-00-00.

Replying to mikeperry:

Ok, with the example consensus in the README, I get 2447 lines of output like: "We're missing descriptor 5514060a9697a4bd52c206a08c262abb9bf1b66d. Please make sure that all referenced server descriptors are available. Continuing anyway."

Can you do a git pull? That warning isn't in the code anymore since July.

Also, I tweaked the README a little bit by renaming 2011-07-13-05-00-00-descriptors to 2011-07-13-05-00-00-serverdesc so that it looks more like the URL. Nothing to worry about, but maybe it removes one potential error source.

Let me know if you run into any other problems.

Please git pull again. I fixed a few things. You'll also like the new plot.sh.

I do like plot.sh. Thanks for that. Though on line 5 you don't use the $COMMONS variable in your file check. My commons-codec-1.4 is a system package.

I also am getting 500 server errors for the urls. Is there an issue with metrics.tp.o having a delay for current consensuses? I tried some from yesterday though and it still gave me the same 500 error using plot.sh.

Please run git pull again. I fixed line 5. I also subtracted 30 minutes from the current system time and rounded that time to the last full UTC hour. That should give the metrics host enough time to fetch the consensus we're requesting.

The 500 server errors come from the metrics database being overloaded. I just kicked Tomcat, and it works again. If the problem happens again, ask me or weasel to restart Tomcat on yatei. It's always safe to do that, and in this case it even helps. (I'm working on an improved metrics database that doesn't run into these problems, but don't hold your breath.)

Trac:

Realy fraction graph for Nov 11

Trac:

Measured fraction graph from Nov 11

Karsten: The votes for urras in these two graphs look odd to me, so I want to just confirm I'm reading them right.

On the relay-votes graph, it looks like 60% of the Guard nodes have a ratio of < 0.1 (I assume 0)? Similarly, 40-50% of the Exits also have a ratio of ~0? That seems to be a lot of nodes with no capacity.. Yet urras's measurements look fine when not in PID mode. Definitely a bug somewhere, it seems..

On the measured-votes graph, it looks like urras hates Guards and Exits so much that their total measured bandwidth is way way below what the consensus has, and so their CDF only goes up to like 10-20% of the consensus total? In other words, you did not scale these graphs (which is fine). Is that also right?

2011-11-11-14-00-00 is the exact valid-after time for those two attached files, btw.

Ok, it looks like that is what is happening. A lot of Guards are getting 0 as a bandwidth vote.. Still not sure why yet.

Also not sure why it also hates Exits.. That might be a different bug, as I don't actually see any Exits with 0 bw in the vote file... It does look like it is hating them less as of consensuses today, though.

I solved the Guard 0 bw bug as well as the exit issues and some other problems, and PID feedback code is now running on 4/5 of our bw auths. We now have a ton of knobs and options to tune, to try to find out which values provide sufficient feedback without breaking the network. I've created #4596 (moved) for that.

Trac:
Status: new to closed
Resolution: N/A to fixed
Actualpoints: N/A to 20

closed

changed time estimate to 48h

added 160h of time spent

moved to tpo/network-health/torflow#4425 (closed)

Monitor and tweak PID feedback control

Child items ...

Activity