The consensus-health checker can look at these values and warn if one of them gets too high or too low. What values should it consider normal, and when should it begin warning?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Hm, I don't really know which static values we should consider normal in this case. Maybe #8145 (moved) is kind of related?
I guess that in an ideal future we would have some kind of anomaly detection (a dynamic system) to find abnormalities in those values. (Although, the current anomaly detection system we have for censorship events does not work too well, does it?)
Hm, I don't really know which static values we should consider normal in this case. Maybe #8145 (moved) is kind of related?
Only if you can translate that into upper/lower bounds I can write into the consensus-health checker code.
I guess that in an ideal future we would have some kind of anomaly detection (a dynamic system) to find abnormalities in those values. (Although, the current anomaly detection system we have for censorship events does not work too well, does it?)
I hope static bounds will do the trick for now. I agree that we're not very good at writing anomaly detection systems.
I'm attaching a plot of flag thresholds reported by moria1 and gabelmoo, which I'm going to renew in 1 week and in 2 weeks. Then we can define bounds when we want to get notified.
Just added a new graph. The values for stable-mtbf and guard-wfu deviate more than expected, which may well be a bug we simply didn't see before. I think we should wait for the other authorities to upgrade to 0.2.4.10-alpha-dev or higher and report these values, too. Then we can define thresholds for the consensus-health checker to warn.
Just added a new graph. The values for stable-mtbf and guard-wfu deviate more than expected, which may well be a bug we simply didn't see before. I think we should wait for the other authorities to upgrade to 0.2.4.10-alpha-dev or higher and report these values, too. Then we can define thresholds for the consensus-health checker to warn.
Interesting graphs!
How come you graphed only those three authorities? Do they all run different versions of Tor? The stable_mtbf and guard_wfu graphs are kind of weird, indeed.
How come you graphed only those three authorities? Do they all run different versions of Tor? The stable_mtbf and guard_wfu graphs are kind of weird, indeed.
These three are the only authorities running recent enough Tor versions to report flag thresholds. Once the other authorities upgrade, they'll be included in the graphs, too.
Updated the graph once more. Finally, we have all nine authorities reporting their flag thresholds, with interesting results. A few observations with respect to finding lower/upper bounds for what the consensus-health checker should consider normal:
The mean stable_uptime of most authorities is around 7.2 days (620000 seconds), whereas turtle's mean stable_uptime is 17.6 days. What's up with turtles, and should we still consider those values normal? How about 5 and 20 (or 10?) days as lower and upper bound to catch extreme values?
I can hardly see a stable state in stable_mtbf. Without turtles, I'd say that gabelmoo, moria1, and dannenberg are heading somewhere, but that process takes very long, probably too long for continuous consensus-health warnings. How about 1 second as lower bound and 3e+6 seconds (34.7 days) as upper bound to see what turtles is up to?
fast_speed looks quite stable, well, except for turtles. I'd say 25 and 75 kB/s would be good lower/upper bounds. But what turtles sets there seems too low.
guard_wfu looks okay. We could probably set a lower bound of 90 to learn about extremes (and 99.99 as upper bound, just to learn when authorities become too demanding).
guard_tk takes a while to get stable after authorities went down for some time (which is what I think was the case with dizum). We could warn about values below 4e+05 seconds (4.6 days) and above 8e+05 seconds (9.3 days).
guard_bw_inc_exits and guard_bw_exc_exits look quite stable, too. But what is turtles doing there? Without turtles, I'd say 1e+05 and 3e+05 are fine lower/upper bounds.
enough_mtbf looks like it's fine with a lower and upper bound of 1, so that we learn when it goes down to 0.
Do these limits make any sense? And what's the reason for turtles behaving different?
Trac: Cc: asn, arma, nickm to asn, arma, nickm, mikeperry
I agree that it would be neat to have this graph on the metrics website. However, I can't afford the time to write and maintain the additional code, and I don't think yatei can handle yet another thing to do. I'm happy to re-run the graphing script every now and then on my laptop if there's need for an updated graph. Sorry.
atagar, is this something you want to do in your Python DocTor? If not, I'd close this ticket, because I'm not working on the Java DocTor anymore. Thanks!
Hi Karsten. I'd be happy to add flag-thresholds checks to DocTor if we both have defined values, and you can tell me what they indicate the authority operator should do (DocTor checks should be actionable, otherwise they aren't terribly helpful). Stem already checks the constraints on parameter values since those have defined bounds in the dir-spec, but the earlier correspondance didn't seem to settle on concrete ranges for flag-thresholds.
Good points. Actually, I don't really know what an authority operator would do in such a case. Feel free to close, or to leave open at minor or trivial priority.