wiki:doc/UpdatingFallbackDirectoryMirrors

Updating Tor's Hard-Coded Fallback Directory Mirror List

This page has instructions for updating Tor's list of fallback directory mirrors.

Read Fallback Directory Mirrors for information about how fallbacks work.

A Typical Release

So you'd like to update Tor's hard-coded list of fallback directory mirrors for the next release?

For major releases, starting at least a month before the code freeze:

  1. Check the existing fallbacks to see if their details have changed, or they are down,
  2. Check for any new potential fallbacks - stable, high-bandwidth relays that haven't yet been whitelisted or blacklisted,
  3. Wait a week or so for operator responses, and
  4. Generate the list of fallbacks, submit a patch, and
  5. Tell the stem and metrics relay search (atlas) maintainers that the list has changed.

For minor releases, if we know that many fallbacks are down, or if it has been more then 6 months since we last checked:

  1. Do a quick refresh using the instructions in Section 3b.
  2. Tell the stem and metrics relay search (atlas) maintainers that the list has changed.

As long as there are major releases every 6 months, and we backport the fallback list every major release, we won't need to check fallbacks for minor releases.

(These instructions work, but I'd like to automate some of the complex configuration, so that the process is more repeatable, and anyone can do it. See the bug numbers next to each task for details.)

How to Contact Relay Operators

When you email relay operators, use the email address (or other contact) in their relay's ContactInfo.

Individual Emails

Put the operator email in the To field.

Please also CC:

  • the address they replied with,
  • any addresses they CC'd,
  • teor (who did the first few fallback lists), and
  • tor-team.

Use the same subject for similar emails, so tor-team members can filter them out if desired.

Group Emails

Put the list of operator emails in the BCC field.

Please also CC:

  • teor (who did the first few fallback lists), and
  • tor-relays, so that operators without ContactInfo can opt-in (and for transparency),
  • atagar, the stem maintainer,
  • the metrics team, who maintain relay search (atlas).

When telling relay operators their relays have been included as fallback directory mirrors, plese BCC the operators, then send a copy of the email to tor-relays with an explanatory note. This helps avoid confusing operators about whether they are on the list or not.

1. Checking Existing Fallbacks

Fallbacks can go down, become slow, or change their identity key (fingerprint), addresses, or ports. We:

  1. check if any existing fallbacks have changed,
  2. inform relay operators,
  3. modify the hard-coded list, and
  4. modify the whitelist and blacklist.

a. Finding Broken Fallbacks

To check existing fallbacks, we use the existing fallback list as the whitelist.

  1. Run the fallback script in check_existing mode, saving the list:
    scripts/maint/updateFallbackDirs.py check_existing > good_fallbacks 2> good_fallbacks.log
    

This can take a long time, as it downloads ~150MB of OnionOO data, parses it, then tries each fallback's DirPort.

b. Handling Broken Fallbacks

To produce a log that only has the whitelisted fallbacks in it, do:

for fp in `cat scripts/maint/fallback.whitelist | grep id= | cut -d" " -f3 | cut -d= -f2`; do
  grep $fp good_fallbacks.log >> good_fallbacks.whitelist.log;
done

The script will log warnings like:

WARNING::0C2C599AFCB26F5CFC2C7592435924C1D63D9484 excluded: has it gained an IPv6 address [2001:41d0:a:fb7a::1]:9001?
WARNING::FCB6695F8F2DC240E974510A4B3A0F2B12AB5B64 excluded: has it changed IPv4 from 195.154.8.111 to 178.62.12.24?
WARNING::Consensus download: 20.5s too slow from IPredator (197.231.221.211:9030), max download time 15.0s.

It will also produce info-level messages about other excluded fallbacks (see "Handling Missing Fallbacks"):

INFO::01A9258A46E97FF8B2CAC7910577862C14F2C524 not a candidate: running avg too low (0.116029)
INFO::1ECD73B936CB6E6B3CD647CC204F108D9DF2C9F7 not a candidate: guard avg too low (0.000000)
INFO::6A640018EABF3DA9BAD9321AA37C2C87BBE1F907 not a candidate: version not recommended

If relays are down, stem will produce log messages like this:

INFO::Initiating consensus download from armbrust (176.31.180.157:143).
INFO::Unable to retrieve a consensus from armbrust: <urlopen error [Errno 61] Connection refused>

Other relays may have been missing for so long, they aren't even in the OnionOO data.

For each warning and each excluded relay:

  1. email the operator (see "How to Contact Relay Operators"), and ask them if the change is permanent,
    • if there's nothing an operator can do to fix the issue (for example guard average), there's no reason to email them,
  2. remove the fallback from the hard-coded list, by commenting it out,
    • you must modify the start of each line, so that stem's automated parser detects the fallback change,
    • backport the modified hard-coded list to every Tor version >= 0.2.8 that might ever be re-released,
  3. move the relay from the fallback.whitelist to the fallback.blacklist,
    • if the script will pick up improvements automatically (uptime, guard average, version), there's no reason to blacklist the relay.

If the operator has fixed the issue, leave the fallback in the hard-coded list and the whitelist. But if they have a history of unexpected changes, it is safer to remove the relay from the hard-coded list, and blacklist it.

We want to streamline handling minor changes in #20910.

Handling Missing Fallbacks

Some of the excluded fallbacks are only logged at info level, because they are excluded by the script.

To find these fallbacks, do:

diff -u src/or/fallback_dirs.inc good_fallbacks

(The diff should be reasonably small, because the fallbacks are sorted in fingerprint order.)

Then, once you have a list of these fallbacks, search the log for their fingerprints. If they aren't in the log, try searching Atlas for their fingerprint. Also try their IPv4 address.

Handling New IPv6 Addresses

Adding an IPv6 address is a special case: clients never knew about the new IPv6 address, so they will continue to bootstrap fine using the IPv4 address. There's no need to remove the relay from the existing hard-coded list.

But the relay's whitelist entry must include the new IPv6 address, so that it can be selected for the next release. (If the IPv6 address isn't stable, then that address can't be added to the whitelist. But since the descriptor and whitelist don't match, the relay can never be selected. We'll fix this in #20175.)

Handling Consensus Download Failures

Sometimes fallbacks are temporarily slow to serve consensuses, or temporarily down. Check again, at a different time, and from a different network connection, if you can.

If relays are down a lot, the uptime checks will eliminate them in the next release. But it might be worth removing them from the current release.

Formatting Changes Files

When we replace the entire list of fallbacks, this is what a changes file should look like:

  o Minor feature (fallback directory mirrors):
    - Replace the 80 remaining fallbacks of the 100 originally introduced
      in Tor 0.2.8.1-alpha in January 2016, with a list of 100 fallbacks
      (40 new, 60 existing, 20 removed) generated in March 2016.
      Closes ticket 17158; patch by teor.

When we remove unsuitable fallbacks, this is what a changes file should look like:

  o Minor feature (fallback directory mirrors):
    - Remove 1 fallback that is no longer suitable, leaving 80 of the 100
      fallbacks originally introduced in Tor 0.2.8.2-alpha in March 2016.
      Closes ticket 20243; patch by teor.

You can use the following commands to count fallbacks:

Remaining/Replacement:

grep -c id= src/or/fallback_dirs.inc

New:

git diff src/or/fallback_dirs.inc | grep id= | grep + | wc -l

Removed (by deletion):

git diff src/or/fallback_dirs.inc | grep id= | grep - | grep -v \* | wc -l

Removed (by comments):

git diff src/or/fallback_dirs.inc | grep id= | grep + | grep \* | wc -l

The existing count is the replacement count, minus the new count. (It can't be calculated directly from the git diff, because diff skips unchanged lines.)

Once you have made the change, run make test to check for parse errors.

2. Finding Potential Fallbacks

The fallback.whitelist contains relays whose operators opted-in to become a fallback directory mirror at the last release. But new, stable relays could have been started since then, so we need to find them, and ask their operators if they want to opt-in. We:

  1. decide how many fallbacks to select, and pick a bandwidth cutoff,
  2. find potential fallbacks,
  3. revise the bandwidth cutoff,
  4. ask relay operators to opt-in,
  5. add opt-ins to the whitelist (and opt-outs to the blacklist).

a. Picking a Bandwidth Cutoff (optional)

See "How Many Fallbacks?" to decide how many fallbacks to choose for each release.

In 0.2.8, we emailed tor-relays, and specifically contacted relay operators whose relays whose advertised bandwidth was 3 MByte/s or more. This was 150x the expected extra load of 20 KByte/s, calculated using the reported bandwidth of the directory authorities:

https://lists.torproject.org/pipermail/tor-relays/2015-December/008393.html

But the actual figure for directory authority directory downloads is ~100 Mbit/s (source: metrics team), and bootstrapping client consensus downloads are somewhere between 10% and 50% of that (the other 50% are directory and exit directory document downloads). So the client consensus load is at most:

100 Mbit/s / 8 bit/byte * 10% to 50% clients = 1 to 7 MByte/s

We want to select 200 fallback directory mirrors in 0.2.9, and we think there will be 100 remaining after 2.5 years. So the extra load is somewhere between:

1 MByte/s / 200 =  5 KByte/s
7 MByte/s / 100 = 70 KByte/s

which is an unnoticeable amount, although it would tend to spike when there are lots of new Tor Browser and Orbot downloads.

So it seems reasonable to have an advertised bandwidth cutoff of 3 MByte/s when selecting 200 fallback directory mirrors. If we ever select ~300 fallback directory mirrors, it might be worth making the cutoff 2 MByte/s (by changing MIN_BANDWIDTH in the updateFallbackDirs.py script).

b. Find Potential Fallbacks

We want to only find new potential fallbacks, and ask their operators to opt-in. So we add both lists together, and tell the script not to select anything in those lists. (All those operators have already been contacted, and responded.)

  1. Make sure the whitelist is the current one from git master (not the modified one from step 1)
  2. Make sure the blacklist is saved in a git branch or a backup
  3. Append the whitelist to the blacklist
    cat scripts/maint/fallback.whitelist >> scripts/maint/fallback.blacklist
    
  4. Set the log level to INFO for both the script and stem:
    - logging.basicConfig(level=logging.WARNING)
    + logging.basicConfig(level=logging.INFO)
    
    - logging.getLogger('stem').setLevel(logging.WARNING)
    + logging.getLogger('stem').setLevel(logging.INFO)
    
  5. Output candidate fallbacks (ignore the whitelist, log operator contact details):
    - OUTPUT_CANDIDATES = False
    + OUTPUT_CANDIDATES = True
    
  6. Run the fallback selection script, saving the list:
    scripts/maint/updateFallbackDirs.py > potential_extra_fallbacks 2> potential_extra_fallbacks.log
    

This can take a long time, as it downloads ~150MB of OnionOO data, parses it, then tries each fallback's DirPort.

Ignoring the whitelist when contacting potential fallbacks will be automated in #20192.

Logging candidate fallbacks at warning level will be automated in #20177.

Logging stem connection failures at warning level will be automated in #20178.

We might remove relays without contact info in #20883, but it's easy just to scroll past them.

c. Revising the Bandwidth Cutoff (optional)

When we change the number of relays, we need to revise the bandwidth cutoff, then re-run the fallback selection process.

The 3 MByte/s minimum ends up selecting a lot of relays. But we know that if we select 200 relays, the lowest relay will actually have N MByte/s. So there's not much point in asking operators to opt-in, if their relays will never be chosen.

So impose a limit of 200 relays, then find the minimum bandwidth in the list, and re-run "Find Potential Fallbacks" with this minimum bandwidth.

  1. Use the latest whitelist and blacklist
  2. Output candidate fallbacks (applying the whitelist, output operator contact details):
    OUTPUT_CANDIDATES = True
    
    - MAX_FALLBACK_COUNT = None if OUTPUT_CANDIDATES else 200
    + MAX_FALLBACK_COUNT = 200
    
    - INCLUDE_UNLISTED_ENTRIES = True if OUTPUT_CANDIDATES else False
    + INCLUDE_UNLISTED_ENTRIES = False
    
    - OUTPUT_SORT_FIELD = 'contact' if OUTPUT_CANDIDATES else 'fingerprint'
    + OUTPUT_SORT_FIELD = 'measured_bandwidth'
    
  3. Run the fallback selection script, saving the list:
    scripts/maint/updateFallbackDirs.py > potential_extra_fallbacks 2> potential_extra_fallbacks.log
    

During the 0.2.9 fallback regeneration in September 2016, the last relay was:

youlooksuspicious
Flags: Fast Guard HSDir Running Stable V2Dir Valid
Bandwidth: 8.0 MByte/s, Consensus Weight: 16900

So revise the bandwdith cutoff to 8 MByte/s, restore the MAX_FALLBACK_COUNT, INCLUDE_UNLISTED_ENTRIES and OUTPUT_SORT_FIELD, and re-run "Find Potential Fallbacks":

- MIN_BANDWIDTH = 102.4 * 30.0 * 1024.0
+ MIN_BANDWIDTH = 8.0 * 1024.0 * 1024.0

This process will be automated in ticket #20876.

d. Asking Relay Operators

After (re-)running "Find Potential Fallbacks", contact operators using the process in 'Group Emails' section above, BCC'ing each operator in the list.

Keep the info-level logs somewhere publicly accessible (a trac ticket is a good place), so that people can see exactly what decisions the script made.

e. Whitelisting Fallbacks

As responses come in, add them to the whitelist.

If any operators confirm their relays will never be suitable, add them to the blacklist. (This makes it easier to switch to opt-out in future.)

3. Creating a New List of Fallbacks

a. How Many Fallbacks? (optional)

Every extra fallback:

  • spreads the load more evenly,
  • provides diverse IP addresses and ports for clients to contact,
  • provides another relay that can take up the load if some fallbacks fail.

But it also:

  • adds to the size of Tor,
  • adds a relay that might go down or change details,
  • adds to a list that might be blocked on some networks.

In 0.2.8:

  • Relay operators opted-in between December 2015 and April 2016,
  • 100 fallbacks were selected in April 2016 for 0.2.8.2-alpha (1fd4340),
  • 11 were eliminated in June and July 2016 before the 0.2.8.5-rc release, and
  • 8 were eliminated in September 2016 before the 0.2.8.8 and 0.2.9.4-alpha releases.

Since we lost approximately 20 of fallbacks in 6 months, and the expected release lifetime is 2 years, and we want to be able to skip updating fallbacks for 1 release if we need to, we want to select at least:

100 + 20*5 = 200 fallbacks

to have 100 by the end of the next 2.5 years.

Due to #19610, IPv6 fallbacks serve both microdescriptor consensuses and microdescriptors to IPv6-only clients. (Or at least they will once #19608 is fixed, hopefully in 0.2.9.) These clients contact 15 fallbacks to download ~7000 relay microdescriptors, so we want to make sure there are 20 by the end of the 2.5 years, in case the Tor network has grown to ~10000 relays by that time.

Since we lost 6 of the 25 IPv6 fallbacks in 6 months, and the expected release/fallback lifetime is 2.5 years, we want to select at least:

20 + 6*5 = 50 IPv6 fallbacks

to have 20 by the end of that 2 years. We should get close to 50 IPv6 fallbacks by selecting 200 fallbacks in total.

Since we select fallbacks in bandwidth order, each extra fallback is likely to be less reliable than those already selected. And since the extra load is so low, there's not much advantage in selecting many more fallbacks than we need to last the 2.5 year release lifetime.

The original proposal was to select 20% of the guards (~200 relays) as fallback directory mirrors. Having 100 by the end of the 2.5 years is less than we might have wanted, but since the extra load is so small, it's unlikely to be an issue.

b. Generating the List

We want to find the full list of fallback directory mirrors.

  1. Make sure the whitelist and blacklist and script are the current ones from git master (not the modified ones from other instructions)
  2. Set the log level to INFO for both the script and stem:
    - logging.basicConfig(level=logging.WARNING)
    + logging.basicConfig(level=logging.INFO)
    
    - logging.getLogger('stem').setLevel(logging.WARNING)
    + logging.getLogger('stem').setLevel(logging.INFO)
    
  3. If you have a working IPv6 connection (or can set up an IPv6 tunnel), perform IPv6 checks:
    - PERFORM_IPV6_DIRPORT_CHECKS = False if OUTPUT_CANDIDATES else False
    + PERFORM_IPV6_DIRPORT_CHECKS = True
    
  4. Run the fallback selection script, saving the list:
    scripts/maint/updateFallbackDirs.py > src/or/fallback_dirs.inc 2> fallback_dirs.log
    

This can take a long time, as it downloads ~150MB of OnionOO data, parses it, then tries each fallback's DirPort.

c. Discovering Broken Fallbacks

You can check the entire whitelist by running the script like this:

  1. Make sure the whitelist and blacklist and script are the current ones, including the latest operator updates (not the modified ones from other instructions)
  2. Set the log level to INFO for both the script and stem:
    - logging.basicConfig(level=logging.WARNING)
    + logging.basicConfig(level=logging.INFO)
    
    - logging.getLogger('stem').setLevel(logging.WARNING)
    + logging.getLogger('stem').setLevel(logging.INFO)
    
  3. Output candidate fallbacks (ignore the whitelist, log operator contact details):
    - OUTPUT_CANDIDATES = False
    + OUTPUT_CANDIDATES = True
    
  4. Perform IPv4 checks:
    - PERFORM_IPV4_DIRPORT_CHECKS = False if OUTPUT_CANDIDATES else True
    + PERFORM_IPV4_DIRPORT_CHECKS = True
    
  5. If you have a working IPv6 connection (or can set up an IPv6 tunnel), perform IPv6 checks:
    - PERFORM_IPV6_DIRPORT_CHECKS = False if OUTPUT_CANDIDATES else False
    + PERFORM_IPV6_DIRPORT_CHECKS = True
    
  6. Run the fallback selection script, saving the list:
    scripts/maint/updateFallbackDirs.py > whitelist_check 2> whitelist_check.log
    

This can take a long time, as it downloads ~150MB of OnionOO data, parses it, then tries each fallback's DirPort.

These steps will be automated in #20897.

(It's also a good idea to re-run these checks without the extra modifications to the script, as in "Generating the List", before generating the actual list.)

Follow the steps in "Handling Broken Fallbacks" above to follow-up with relay operators whose relays are on the whitelist, but are being excluded from the list.

d. Informing Relay Operators

After "Generating the List", contact operators using the process in 'Group Emails' section above, BCC'ing each operator in the list. It aids transparency if you first post a draft list, and then allow the operators to make corrections. (Once you've emailed all the operators on the draft list, you only need email the operators whose relays were added or deleted from the final list. This makes it easier for you, and keeps the volume of email down.)

Keep the info-level logs somewhere publicly accessible (a trac ticket is a good place), so that people can see exactly what decisions the script made.

It can be helpful to label the list and logs with the date and time (in UTC) and the tor git commit used to generate the list. This aids others in reproducing the list. (Except for the network reachability and download time checks, which happen in real-time and are non-deterministic.)

If you want to generate the same list, but with operator contact details, follow the steps in "Finding Broken Fallbacks". To find the contacts for an old list, follow those steps, but use:

-PERFORM_IPV4_DIRPORT_CHECKS = False if OUTPUT_CANDIDATES else True
+PERFORM_IPV4_DIRPORT_CHECKS = False
-OUTPUT_COMMENTS = True if OUTPUT_CANDIDATES else False
+OUTPUT_COMMENTS = True
-CONTACT_COUNT = True if OUTPUT_CANDIDATES else False
+CONTACT_COUNT = True
-OUTPUT_SORT_FIELD = 'contact' if OUTPUT_CANDIDATES else 'fingerprint'
+OUTPUT_SORT_FIELD = 'contact'
-BLACKLIST_EXCLUDES_WHITELIST_ENTRIES = True
+BLACKLIST_EXCLUDES_WHITELIST_ENTRIES = False
-MAX_DOWNTIME_DAYS = 0 if MUST_BE_RUNNING_NOW else 7
+MAX_DOWNTIME_DAYS = 0
-CUTOFF_RUNNING = .90
-CUTOFF_V2DIR = .90
-CUTOFF_GUARD = .90
+CUTOFF_RUNNING = .0
+CUTOFF_V2DIR = .0
+CUTOFF_GUARD = .0

If you have the old OnionOO files, set:

-LOCAL_FILES_ONLY = False
+LOCAL_FILES_ONLY = True

This will be automated in #20946.

e. Updating the List in Tor's Development Branch

Do:

git add src/or/fallback_dirs.inc

Then add a changes file based on the instructions in "Formatting Changes Files".

Remember to run make test to check for parse errors!

f. Backporting the List to Other Tor Versions

It's worth backporting this list to any tor versions >= 0.2.8 that are still being actively maintained. Otherwise old tor versions may gradually have more and more difficulty bootstrapping.

Open a ticket for the backport, and do the backport once the master branch has had some testing.

4. Checking the List Before the Stable Release

Follow the steps in "Checking Existing Fallbacks", and remove the fallbacks that don't work, using the instructions in that section.

Last modified 3 days ago Last modified on Nov 21, 2017, 10:57:54 PM