Try cranking up cbttestfreq consensus param, to see if it helps the current overload
In Tor 0.3.1.1-alpha, commit d5a151a, we switched:
-#define CBT_DEFAULT_TEST_FREQUENCY 60
+#define CBT_DEFAULT_TEST_FREQUENCY 10
And on May 20 2017 the dir auths set the cbttestfreq consensus param to 10 as well.
Right now the network is overloaded with create cells, from the millions of new clients that showed up in the past weeks.
Hypothesis 1: most of these clients are in learning mode much of the time, so 5 million clients * 10 seconds = 500k new create requests per second launched at the network, which contributes to the overload.
Hypothesis 2: some of these clients have learned quite low timeouts, causing them to generate many circuits which they then almost immediately cancel, but not enough of their circuits fail that they back away from their learned value.
Hypothesis 3: the clients are stuck in a sad loop where they learn a low cbt value, generate circuits for a while that mostly time out, eventually they give up on their cbt value, then they generate a circuit every 10s until they re-learn a low cbt value, and they cycle.
The experiment here (set cbttestfreq to 600 seconds temporarily) should help us test these hypotheses. For 1, we will immediately reduce the load of new circuits. For 2, this will help more slowly, because we'll have to wait for each client to hit a situation where 90%+ of its circuit attempts are being timed out, but in theory clients will slowly shift from having a too-aggressive cbt, back into learning mode. And for 3, we'll push most clients to the "learning, but very slowly" phase of their sad loop.
We can use the notice-level heartbeat messages in relay logs, to discover whether the total number of create cells goes down dramatically. If it does, win, we confirmed one or more of these hypotheses, and we can make a plan from there. If it doesn't, also win, we know we need to look elsewhere.