Now that 0.2.9.5-alpha is out (which we think fixes #20499 (moved), right?), it might make sense to look through #20501 (moved) and see which relays are buggy, and contact their operators to get them to upgrade?
That said, the question I think is not whether it is delivering a bad consensus right now. The question is whether it is running the buggy versions.
Maybe a stem script to take in the list of fallback dirs, and output the ones that are running a buggy version, would be what you want? Then we can re-run the script periodically until we reach the point where we want to un-fallbackdir the ones that stubbornly remain. (And somewhere in there we should do the "send mail to all the operators running buggy versions to let them know that they need to upgrade. Pretty soon I think, right? Once 0.2.9.5-alpha packages are out and ready?)
We don't want to ever add any fallback directories running the buggy versions,
We don't want to ever add any fallback directories that can't deliver a recent consensus, regardless of version, and
We want to remove any fallback directories that do either of the above things.
If we modify the fallback selection script to check 1 and 2, 3 will happen automatically when we next rebuild the list (or when we remove failed fallbacks from the existing list).
Sounds good. In terms of 3, we might be happiest if we contact the buggy relay operators and give them some chance to upgrade, rather than immediately cutting out all of the fallback operators who were kind enough to test the alpha version for us. :)
Sounds good. In terms of 3, we might be happiest if we contact the buggy relay operators and give them some chance to upgrade, rather than immediately cutting out all of the fallback operators who were kind enough to test the alpha version for us. :)
I try to mail fallback operators before removing fallbacks, as most issues are resolvable by the operator, and it helps me determine whether the issue is permanent (lost keys, lost IPs) or temporary (changed ports).
Trac: Summary: Make sure fallback directories deliver a recent consensus to Make sure fallback directories aren't running buggy versions / can deliver a recent consensus
#20501 (moved) has a listing of relays with this issue and a small script to check if relays are serving a stale consensus or not. If you wrap that check into your fallback selection script ya should be good to go.
Bug #20499 (moved) affects versions from 0.2.9.1-alpha-dev to 0.2.9.4-alpha-dev and version 0.3.0.0-alpha-dev, so we need to exclude these versions as fallbacks. We can't rely on the authorities to do this, as #20509 (moved) has not been deployed to directory authorities yet.
We should also exclude authorities that can't deliver a recent microdesc consensus, based on the script in #20501 (moved). We already download a consensus from every authority to check download times. If we download a miscrodesc consensus, that's what clients will be downloading. And it's slightly faster.
I think I might need to change this check to tolerate consensuses as old as REASONABLY_LIVE_CONSENSUS (24 hours), because of #20909 (moved).
Ideally, we should only tolerate RELAY_CLOCK_SKEW (3 hours) or maybe even RELAY_CLOCK_SKEW - 2 hours = 1 hour, because directory mirrors should update in the first hour. (Check this with the dir-spec.)
Batch-move updateFallbackDirs.py tickets into a new component, and remove them from maint-0.3.0.
I'm doing this as a separate component, after discussion with teor, mainly because development here seems to be decoupled from development on tor itself: they don't need to have the same release schedules, for example.