That job I linked to is some weird relay job. Relay jobs on mturk are a bad idea because people don't know the risks of what they are getting into. We should make a bridge-specific job for this.
The question that still remains is might tonga be hitting some other issue that is preventing it from testing more than a certain number of bridges successfully. Supposedly it tests 1/256 of the keyspace every 10 seconds, which means a connection every 3 seconds with the current bridge pool.
From the logs now, it looks like we're not hitting file descriptor limits, but it seems like eventually we might. We could also be hitting some other limit on the box that is causing it to fail reachability testing for half of these bridges.
Err, two corrections. It's apparently 1/128 of the keyspace, and I believe all the connection attempts happen immediately in parallel, and are not spread out serially. So we've got a burst of 6-7 tcp connect attempts on tonga every 10 seconds...
I think this might mean the stat we should look at next is what is the failure rate for each round of these reachability tests.. If it is always 2 or 3/6, then we might be on to something.
Hrmm. rep_hist_note_router_unreachable() is called only 2x an hour.. I guess this is because it's only called during NS/vote generation.. bleh. Info logs might not have what we need for this.