Enumerate cases where we want to retry circuits, and correctly balance retry robustness with guard discovery
In Tor, and especially in the onion service subsystem, we have a bunch of situations where a circuit could fail and we ought to retry:
- the onion service attempting to connect to the rendezvous point
- the onion service attempting to publish an hsdesc to the hsdir
- the client attempts to reach the intro point to send its intro1 cell
If we retry too many times, we open ourselves up to new guard discovery attacks (see prop 247). If we retry too few times, we end up with robustness or reachability problems ("Tor doesn't work").
It would be nice to just design the single best retry algorithm, and then apply it to all cases. That way we do the hard design and analysis work once, and we don't end up with extra complexity when we combine multiple retry designs. But I think a single best design might not be possible -- compare the service-side hsdir case, where it might be best to wait a while before retrying, to the rend point case, where waiting a while before retrying is not so good. Maybe that argues that we can get away with two best designs, one in the "online, somebody's waiting on me" case, and the other in the "offline, let's get this done reliably but there's no immediate rush" case?
Suggested next step: We should write a proposal, with a section enumerating all of the retry situations that tor has; and a section enumerating what we can learn about where the circuit failures are, and how, and how reliable each is (network failure? last hop faillure? guard failure?); and then a section trying to produce the smallest possible number of good designs such that every retry situation is handled well.