Opened 5 months ago

Last modified 7 weeks ago

#30817 assigned task

Write a proposal for tor bootstrapping that works on slow links, but avoids slow relays

Reported by: teor Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-bootstrap, teor-backlog
Cc: gaba Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

In #16844, clients on slow links time out before they can download a full consensus

In #21969 and children, relays fail to bootstrap because a directory authority limits DirPort download speeds (a similar bug exists for clients which try slow relays)

We need to redesign tor's bootstrap so that tor works when the link is slow, but tries another relay when the relay is slow.

We should implement multiple concurrent downloads for all directory documents, not just consensuses. Once we have multiple concurrent downloads, we can increase the timeouts substantially.

We should limit the number of concurrent downloads to 3, because if 3 fast relays are all slow, it's probably the link that it slow. And a 3x download size is an acceptable cost. (It probably won't be that bad, because we delay starting the 2nd and 3rd fetches, and terminate them when the first one completes.)

We could also be a bit more clever, and terminate the download that would take the longest time to finish, after a soft timeout. And then reduce the number of concurrent downloads by one.

We could also simplify the relay/authority selection logic:

  • relays try authorities first, then after a delay, they try other relays (most relays are directory mirrors, so they do this anyway)
  • clients try relays first, then after a delay, they try authorities

And simplify the ORPort/DirPort selection logic for directory downloads:

  • clients always download via ORPorts on relays and authorities, for security
  • relays always download via DirPorts on authorities, to avoid SSL CPU load on a small number of machines
  • relays always download via ORPorts on other relays, for security (CPU load doesn't matter that much)

We could design a new directory download module to implement this logic, using pieces of the existing modules, but with a cleaner, high-level interface:

  • request a download of a particular directory document, or set of directory documents
  • pass a download configuration with:
    • an optional directory cache,
    • ORPort / DirPort preference,
    • number of permitted concurrent connections,
    • relay and authority initial delays,
    • status and completion handlers.

Child Tickets

TicketStatusOwnerSummaryComponent
#16844newSlow clients can't bootstrap because they expire their consensus fetch but then receive all the bytes from it anyway, making them expire their next fetch, putting them in a terrible loopCore Tor/Tor
#21969newasnWe're missing descriptors for some of our primary entry guardsCore Tor/Tor

Change History (10)

comment:1 Changed 4 months ago by irl

This looks a lot like Happy Eyeballs, which may have some useful considerations in it.

There are applications other than tor that download directory information, including alternative implementations, controller libraries and Tor Metrics tools. While defining how all of those tools should behave is going to be out-of-scope for this proposal I would appreciate it if we could highlight cases where something may not be appropriate for another tool to copy.

In a Tor Tech Report (section 3) we put together a design for an algorithm to archive directory information that might have some considerations that are relevant. It will probably be more useful for directory caches than clients. One of the stretch goals was to have the archive service also act as a directory cache in the network.

comment:2 in reply to:  1 Changed 4 months ago by teor

Replying to irl:

This looks a lot like Happy Eyeballs, which may have some useful considerations in it.

You're right, there is probably some clever way to do IPv4/IPv6 with this proposal as well.

We will probably need a few different abstractions, so that the complexity doesn't overwhelm us.

Here are some useful abstractions:

  • a directory document request or ORPort connection request
  • an endpoint: an IPv4 ORPort, IPv6 ORPort, and IPv4 DirPort
    • a way of deciding which endpoints are permitted and preferred
  • a relay, with a list of endpoints on that relay
  • a list of endpoints to try
    • a way of deciding when to try the next endpoint
    • relay and authority initial delays
    • number of permitted concurrent connections
  • a connection to an endpoint
  • an in-progress download
    • a way of deciding which downloads are permitted (or cancelled) and preferred
    • priority connections
  • a way to check how the connections or downloads are progressing
    • status handlers
  • a way to know when connections or downloads are ready
    • completion handlers

We could have a basic, low-level interface that takes:

  • a directory document path or ORPort connection request
  • a list of endpoints,
  • a delay mode (or just use tor's default exponential mode),
  • the number of permitted concurrent connections (possibly per-stage: TCP, SSL, link, directory, download?),
  • a timeout,
  • a mode for killing connections on timeout (or just decide on a default, example: kill stalled connections),
  • a completion handler

And a status handler that takes an in-progress request, and returns its progress.

Then we could build high-level interfaces that:

  • fetch a document from the local cache, or download that document and cache it
  • take a list of directory caches and authorities, and expand them into a list of endpoints

I'm still not sure I have the right abstractions, but I'm getting there. We should definitely hide a lot of the complexity in the module, so other modules don't have to worry about it.

comment:3 Changed 4 months ago by teor

I think we'll do well with the following abstractions:

  • a request for a specific resource
  • a pool of available connections
  • a connection to a specific endpoint
  • lifecycles to manage the request, pool, connection, and endpoint states
  • a prioritiser and scheduler (which can implement happy eyeballs) to make sure the pool is healthy, and get the resource from the pool

Building up the layers:

  • we have pools of Fallback Directory Mirrors, Authorities, Directory Guards, Guards, and Bridges
  • we have pools of ORPort and DirPort connections
  • we have a pool of directory requests for each kind of directory document
  • we have a pool of preemptive circuits for each kind of circuit

comment:4 Changed 4 months ago by teor

Owner: set to teor
Status: newassigned

I should actually do this for sponsor 31, if I can.
Maybe it also affects sponsor 28, because some bridge clients have this issue as well.

comment:5 Changed 4 months ago by teor

Cc: gaba added
Sponsor: Sponsor31-can

I made a pad for Nick with one possible API design:
https://pad.riseup.net/p/wzpDF69wOw_yWLHo9Lcm-keep

Gaba, this ticket could help us fix bridge and client bootstrap issues, and v3 onion service reachability issues.
It involves some redesign and refactoring.
I'm not sure if it fits in Sponsor 31 or sponsor 28 or some other sponsor.

comment:6 Changed 4 months ago by teor

Keywords: sponsor31-maybe sponsor28-maybe added

comment:7 Changed 4 months ago by gaba

Keywords: sponsor31-maybe sponsor28-maybe removed
Sponsor: Sponsor31-can

comment:8 Changed 3 months ago by teor

Keywords: teor-backlog added
Owner: teor deleted
Sponsor: Sponsor31-can

These items are not on our roadmap, and they do not have a sponsor.
But I might do them some day.

comment:9 Changed 7 weeks ago by s7r

Still getting it often in Tor 0.4.1.2-alpha-dev on a service that is configured with NumEntryGuards = 3:

Sep 05 03:22:50.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 1/3 of our primary entry guards (total microdescriptors: 6515/6541).
Sep 05 03:22:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for 1/3 of our primary entry guards (total microdescriptors: 6515/6541).

If NumEntryGuards is > 1, and we have descriptors for at least 1 primary entry guard, shouldn't it be possible to still build circuits? It's OK to complain about not having it yet in the log file, but I think we should still be able to build circuits. This is the purpose why one would raise NumEntryGuards from 1 to some higher value, to gain better connectivity against greater probability of running into a hostile guard.

comment:10 in reply to:  9 Changed 7 weeks ago by teor

Replying to s7r:

Still getting it often in Tor 0.4.1.2-alpha-dev on a service that is configured with NumEntryGuards = 3:

Sep 05 03:22:50.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 1/3 of our primary entry guards (total microdescriptors: 6515/6541).
Sep 05 03:22:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for 1/3 of our primary entry guards (total microdescriptors: 6515/6541).

If NumEntryGuards is > 1, and we have descriptors for at least 1 primary entry guard, shouldn't it be possible to still build circuits? It's OK to complain about not having it yet in the log file, but I think we should still be able to build circuits. This is the purpose why one would raise NumEntryGuards from 1 to some higher value, to gain better connectivity against greater probability of running into a hostile guard.

Hey I think this comment was meant to go on #21969?

Note: See TracTickets for help on using tickets.