Opened 8 years ago

Closed 8 years ago

#4596 closed task (fixed)

Tune PID control knobs

Reported by: mikeperry Owned by: mikeperry
Priority: High Milestone:
Component: Core Tor/Torflow Version:
Severity: Keywords: MikePerryIteration20111211
Cc: arma, aagbsn, karsten, Sebastian Actual Points: 16
Parent ID: Points: 6
Reviewer: Sponsor:

Description

We've created a ton of options to help us select a feedback mechanism that doesn't knock the network over. We'll need to tune them and monitor metrics.tp.o performance graphs and try to select the best ones.

The parameters are at
https://gitweb.torproject.org/torflow.git/blob/HEAD:/NetworkScanners/BwAuthority/README.spec.txt#l483

Child Tickets

Change History (5)

comment:1 Changed 8 years ago by mikeperry

Keywords: MikePerryIteration20111211 added
Points: 6

comment:2 Changed 8 years ago by mikeperry

As part of this, I plan to commit a new set of defaults to origin/master, so we can get rid of most/all of the consensus params.

Roger also wants a couple log messages demoted/defrequented.

comment:3 Changed 8 years ago by mikeperry

Cc: arma aagbsn karsten Sebastian added

And despite the circ dampening, we still had an explosion yesterday. After about 10 feedback loops, the system drove the fastest nodes to bandwidths above INT32_MAX. It looks like the middle nodes hit INT32_MAX first, but non-default-policy Exit nodes were close behind.

The good news is that over half of the nodes in the network were experiencing some rate of circuit extend failures (our CPU overload signal), we just weren't properly listening to it. Right now, the feedback loop is disabled, and we see absolutely 0 circuit failure across the entire network.

I am thinking this means we need a few things:

  1. We need to rethink how the circ dampening works. I think the best plan seems to be that if your circ fail rate goes above X% (for X=10 or 20%), we assign you a pid_error=0, which would keep your bandwidth value constant for that feedback round.
  1. We can consider altering the PID setpoint such that each node class (Guard, Middle, Exit, Guard+Exit), and that node balancing error becomes relative only to other nodes in your class (to prevent situations like Middle nodes always being faster than the rest of the network due to being more prevalent).

Do both of these plans make sense? Should we try them both at the same time?

comment:4 Changed 8 years ago by mikeperry

I lied, we still see plenty of circuit failure now.. I was just not recording it because the feedback loop was disabled in the consensus...

comment:5 Changed 8 years ago by mikeperry

Actual Points: 16
Resolution: fixed
Status: newclosed

I did a bunch of tuning.. I think everything is good enough to just watch it closely for a while. We had a couple nasty issues with Guard nodes that should be fixed, but we're going to have to wait a few weeks for clients to relocate. This is going to require long term observation.

Note: See TracTickets for help on using tickets.