(Sorry for multiple comments, I kept thinking about this ticket today.)
Can you describe the design of this new feature?
The code doesn't match the rough design in the ticket description. That's ok, but without a design, I can't tell the difference between a bug and a feature. In particular, I wonder why we need separate "was_disabled" and "is_disabled" variables.
This IPv6 DNS code is currently unused, so it has never been tested. So I want to make sure we have the design right.
Here are some issues I noticed when reading the code:
the minimum number of queries before failure is 10. But that could happen by chance, on server startup. Let's make the minimum something more reasonable. We can make it at least 1000. But maybe we should set it to 1 when TestingTorNetwork is set. That way, broken IPv6 exits will fail quickly in chutney.
We should find out which DNS errors can be triggered by tor clients, and ignore them. Otherwise, a client that floods an exit with bad DNS queries could disable IPv6 exiting on that relay. I think Nick might be able to help here.
I think it's ok to fail thousands of client circuits, before an IPv6 exit disables IPv6. Because getting the new descriptor to clients can take an hour or two. There's also a tradeoff here: we want quiet exits to disable IPv6 eventually. But we want busy exits to survive a momentary glitch.
When dns_is_broken_for_ipv6 is set to 1, we launch a timer for 24 hours to reset this to 0
While dns_is_broken_for_ipv6 = 1, the relay will not advertise an exit policy.
After 24 hours, the timer will reset to 0
While my PR is not ready for review yet (need tests), I have increased the minimum number of queries to 1000, and on TestingTorNetwork decreased it to 1.
There is one problem with unit tests: The code this adds relies on code that calls the relay module, and some compilations don't include this, so that's why the tests fail.
I don't have time to keep on reviewing this patch right now. I'm really busy with google summer of code and outreachy. So I'm going to pass it to another reviewer.
Which errors should we turn off IPv6 DNS for? All of them? Only the ones that clients can't trigger?
the minimum number of queries before failure is 10. But that could happen by chance, on server startup. Let's make the minimum something more reasonable. We can make it at least 1000. But maybe we should set it to 1 when TestingTorNetwork is set. That way, broken IPv6 exits will fail quickly in chutney.
We should find out which DNS errors can be triggered by tor clients, and ignore them. Otherwise, a client that floods an exit with bad DNS queries could disable IPv6 exiting on that relay. I think Nick might be able to help here.
We also need to think about the risk of DNS-based attacks.
I think it's ok to fail thousands of client circuits, before an IPv6 exit disables IPv6. Because getting the new descriptor to clients can take an hour or two. There's also a tradeoff here: we want quiet exits to disable IPv6 eventually. But we want busy exits to survive a momentary glitch.
Overall, I wonder if this patch is the best way to solve this issue. Perhaps we should manually apply the BadExit flag through the network health team. Perhaps we should set the limits much, much higher.
Do we know how many queries a busy exit processes? And how many timeouts they have?
It's really hard to make a good design without good data.