Opened 7 years ago

Closed 6 years ago

#6783 closed defect (fixed)

should not serve old v2 statuses

Reported by: weasel Owned by:
Priority: High Milestone: Tor: 0.2.5.x-final
Component: Core Tor/Tor Version: Tor: 0.2.3.20-rc
Severity: Keywords: tor-auth
Cc: mikeperry, nickm, andrea Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

when a v2 directory goes away, other tor authorities keep serving their cached copy of the v2 status document of that directory.

If that status directory is old this will result in clients (clients or other relays?) downloading the status document, realizing it's to old, and trying to download it again. ad inf.

While tor26 was serving dizum's two-day old status document it was completely swamped. it had thousands of directory requests open at a time, they were consuming all the bandwidth and memory and it didn't even get to properly participate in consensus building.

Removing dizum's old status document from the cache and restarting tor26 made it happy. It now says 404 and clients don't come back (or if they do, at least it's a cheap 404 and not "here's 100k you'll throw away immediately, have it as often as you want".

I think we should stop serving expired status documents.

Or maybe we should stop serving them entirely. If we still need them between authorities, let's move them to a different URL.

Child Tickets

Change History (34)

comment:1 Changed 7 years ago by nickm

Milestone: Tor: 0.2.4.x-final

Who the heck is downloading v2 networkstatuses? Only authorities should be doing that.

So, I'd like to finally implement proposal 147 in the 0.2.4.x series, which should be the final nail in the coffin of any need for v2 networkstatuses. With that in mind, let's figure out if there's some minimal version of this that will work in the meantime to prevent a recurrence of the issue you describe. No need for anything fancy, since we're going to be removing the need entirely.

I think "no longer serve expired ones" is just fine for now; changing the URL on the other hand would break authorities that haven't upgraded to know the new URL.

Marking for 0.2.4.x, unless this is vital enough for 0.2.3.

comment:2 in reply to:  description Changed 7 years ago by arma

Replying to weasel:

If we still need them between authorities

I think we don't need them between authorities anymore. As Nick says, proposal 147 would be nice. But git commits 2e692bd8c9 and eaf5487d95 (in Tor 0.2.2.12-alpha) made authorities look at v3 votes and fetch descriptors that are new to them:

    - Many relays have been falling out of the consensus lately because
      not enough authorities know about their descriptor for them to get
      a majority of votes. When we deprecated the v2 directory protocol,
      we got rid of the only way that v3 authorities can hear from each
      other about other descriptors. Now authorities examine every v3
      vote for new descriptors, and fetch them from that authority. Bugfix
      on 0.2.1.23.

and I think that is basically the "minimal version that will work in the meantime" that Nick wants.

comment:3 Changed 7 years ago by mikeperry

Cc: mikeperry added

FYI: Because of massive amounts of legacy badcode fail, I have begun restricting dirport requests to only the directory authorities. I am willing to whitelist legitimate legacy services by IP. Adding myself to Cc in case this actually matters for anyone who cares.

comment:4 in reply to:  3 Changed 7 years ago by arma

Replying to mikeperry:

FYI: Because of massive amounts of legacy badcode fail, I have begun restricting dirport requests to only the directory authorities. I am willing to whitelist legitimate legacy services by IP. Adding myself to Cc in case this actually matters for anyone who cares.

Wait, what? Almost all relays use your dirport to ask you for the consensus or descriptors that you tell them about. That's what your dirport is for.

comment:5 Changed 7 years ago by mikeperry

I can whitelist all relays for now. If the stopgap matters more to us than fixing the root issue, I can cronjob it.

comment:6 in reply to:  1 Changed 7 years ago by arma

Replying to nickm:

Who the heck is downloading v2 networkstatuses? Only authorities should be doing that.

I believe there are some old Tor relays out there who still go to the authorities for v2 status documents. Heck, they might even be Tor clients, if they're old enough.

I believe we made dir mirrors stop mirroring v2 a while ago. Before we dump v2 statuses entirely, it might be wise to look through the old code to see what they would do. (Another option is to make a plan for how to measure how it's going, and then dump them and measure how it's going.)

comment:7 Changed 7 years ago by nickm

Keywords: maybe-proposal added

comment:8 Changed 7 years ago by nickm

Keywords: tor-auth added

comment:9 Changed 7 years ago by nickm

Component: Tor Directory AuthorityTor

comment:10 Changed 7 years ago by weasel

Whenever dizum goes down, the remaining authorities seem to get DoSed.

It would be nice to fix this - or add a workaround for this - sooner rather than later.

comment:11 Changed 7 years ago by arma

Cc: nickm andrea added
Priority: majorcritical

weasel is being polite in his choice of words, but he's totally right: this is a serious hassle and threat to the network.

Nick/Andrea, can you fit either "debug this" or "make it easier for others to debug stuff like this" into your medium-term schedule?

comment:12 in reply to:  11 Changed 7 years ago by nickm

Replying to arma:

weasel is being polite in his choice of words, but he's totally right: this is a serious hassle and threat to the network.

Nick/Andrea, can you fit either "debug this" or "make it easier for others to debug stuff like this" into your medium-term schedule?

I don't have much of a clue for what the "debugging" would be here. It would be very easy to implement "serve no authority's v2 status document but your own" some time this week. Would that be a fix here? If so, it's like a 20-line patch at most.

comment:13 Changed 7 years ago by nickm

To be concrete, I've put a few possible things up at "bug6783_maybe" in my public repository. They're untested as heck. Under some characterizations of the problem above, they'll solve it. Under others, they won't. What do you think?

comment:14 Changed 7 years ago by weasel

Is that easier than not serving expired statuses at all?

Not serving expired statuses would perhaps be a smaller change to the behavior of the network when all is working properly.

comment:15 Changed 7 years ago by nickm

I was having a hard time tracking down the definition of expiry for v2 networkstatuses. Is it "24 hours"? according to dir-spec-v2 it is, but it looks like Tor has defined MAX_NETWORKSTATUS_AGE as "240 hours" since at least 0.1.2.x.

I tried changing the rule to "Don't serve networkstatuses older than X hours" (for X=2) in 94b6d1d7e60e933e84c52c3335e58be851958bb1 : we could change X to whatever the critical interval is, if we can figure out what it should be.

I'm not sure which of these if any is "easier".

comment:16 Changed 7 years ago by arma

Belief 1: The end-goal here is to dump everything related to v2 statuses -- don't generate them, don't serve them, don't store them, etc.

Belief 2: The only reason we haven't already is that we don't know what chaos will ensue from old Tors that freak out when things change.

So we could do some change that we think is less likely to cause chaos (I think "Don't serve networkstatuses older than X hours" is such a change).

Or we could do a more serious change ("stop serving v2 statuses").

In either case, we'll want to see if chaos ensues, and have a simple way of rolling back if it does.

I'm inclined to try the more serious change, in case it works.

comment:17 Changed 7 years ago by nickm

Okay, so here's a possible plan to do what Roger prefers:

1) Implement a patch with an option to disables serving V2 directory information entirely.

2) Try a test network where there are v2 authorities, *and* clients and servers running older versions of Tor, with that option. Try enabling that option for a subset of the v2 authorities, and for all of them. Monitor the load on the authorities, and the behavior of the clients, to see if anybody's hosing anybody else.

3) If that doesn't explode, merge that patch into master, and have the remaining v2 authorities try it out all at once. This WILL require coordination, so that y'all can turn the option off in a hurry if it turns out to DoS the network in a way we didn't anticipate.

Plausible?

comment:18 Changed 7 years ago by nickm

Keywords: maybe-proposal removed

(I'm okay with doing this testing after the 10 Dec deadline, since the actual amount of code changes is small, and authority stuff doesn't have to be quite as stable quite as fast. That said, this is a *tricky* feature, and we shouldn't let it wait indefinitely.)

comment:19 in reply to:  17 Changed 7 years ago by nickm

Replying to nickm:

Okay, so here's a possible plan to do what Roger prefers:

1) Implement a patch with an option to disables serving V2 directory information entirely.

This part is done as branch "bug6783_big_hammer"; it makes v2 directory requests get a 404 when "DisableV2DirectoryInfo_" is set. This isn't meant to be a permanent thing; if we disable v2 directory info, we can disable it more thoroughly than this. This is just to test the effects of doing so.

If anybody can find time to try testing the behavior of this patch as described above, that would rock.

If anybody tries this on the public network without first doing it on a test network (for long enough to see what happens when NOBODY has a valid v2 networkstatus, and with all configurations discusssed above), or without first making sure that the v2 authority operators are coordinated to notice problems and turn it back on FAST, they will be doing a risky thing. Watch out!

comment:20 Changed 7 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final

We can do this on the 0.2.5 branch, early in the 0.2.5 branch. It's important, but:

  • none of the authorities seem to be eager to test it
  • we don't know what it'll actually do
  • the authorities upgrade more aggressively than clients and relays
  • authority-only backports are less scary

So let's try it out on a heterogenous test network for a little while first to see what explodes.

comment:21 Changed 7 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final
Status: newneeds_review

weasel hit the obnoxious behavior again. Can nobody test this???

Please, please review.

comment:22 Changed 7 years ago by nickm

(I am okay with merging this into 0.2.4 if andrea or arma is also okay with it, now that it's only enableable on authorities. But it still needs review.)

comment:23 Changed 7 years ago by weasel

When merging this with 0.2.4, apart from the obvious conflict that git points out (config stuff changed in both trees), this also needs the following to build:

--- a/src/or/directory.c
+++ b/src/or/directory.c
@@@@@ -2809,7 -2809,7 -2809,7 -2809,7 +2809,7 @@@@@ directory_handle_command_get(dir_connec
          char *m;
          write_http_status_line(conn, 404, "Not found");
          smartlist_free(dir_fps);
----      geoip_note_ns_response(GEOIP_REJECT_NOT_FOUND);
++++      geoip_note_ns_response(act, GEOIP_REJECT_NOT_FOUND);
          if ((m = rate_limit_log(&reject_v2_ratelim, approx_time()))) {
            log_notice(LD_DIR, "Rejected a v2 networkstatus request.%s", m);
            tor_free(m);

comment:24 Changed 7 years ago by weasel

IT appears self testing dir port reachability might have broken:
Mar 06 21:42:43.000 [warn] Your server (86.59.21.38:80) has not managed to confirm that its DirPort is reachable. Please check your firewalls, ports, address, /etc/hosts file, etc.

comment:25 Changed 7 years ago by arma

I've been running it on moria1 the past few days, and it seems pretty happy.

It's rejecting a bunch of requests. But the process isn't bloating like it was before that. (When dizum went down, moria1 went from around 600MB to around 7000MB in process space.)

comment:26 in reply to:  24 ; Changed 7 years ago by arma

Replying to weasel:

IT appears self testing dir port reachability might have broken:
Mar 06 21:42:43.000 [warn] Your server (86.59.21.38:80) has not managed to confirm that its DirPort is reachable. Please check your firewalls, ports, address, /etc/hosts file, etc.

I have no idea why this might be the case. moria1 found itself reachable. Maybe your dirport is saturated with requests, and one of your firewall rules to defend it ends up dropping some?

Has it, 4 days later, still failed to find its dirport reachable?

comment:27 Changed 7 years ago by arma

Mar 10 16:29:13.107 [notice] Rejected a v2 networkstatus request. [25091 similar message(s) suppressed in last 1800 seconds]
...
Mar 10 16:59:13.425 [notice] Rejected a v2 networkstatus request. [23134 similar message(s) suppressed in last 1800 seconds]
...
Mar 10 17:29:13.562 [notice] Rejected a v2 networkstatus request. [23162 similar message(s) suppressed in last 1800 seconds]

That's about 13 requests a second. Not so bad I guess. On the flip side, that's a lot of old unhappy clients.

comment:28 Changed 7 years ago by arma

I pushed a bug6783_big_hammer to my git repo, which has the cleaned up, rebased, and now-compiling-and-conflict-free version of nickm's branch, all set for merging into maint-0.2.4.

Oh, and also I turned the feature off by default so Tor would start for non-authorities. That means weasel and I will want to set the torrc option when we upgrade.

I think it's fine to merge, and I'd like to get it into the upcoming 0.2.4 release.

comment:29 Changed 7 years ago by nickm

Status: needs_reviewnew

Scary! Let's give it a shot. Merging this. I'm putting this ticket back in "new" for now, until we can figure out which follow-up tickets to implement. Suggestions:

  • Stop trying to download & cache v2 directory info (assuming that we still try)
  • Everybody stops serving v2 directory info, even caches
  • Forget about v2 directory info entirely

comment:30 Changed 7 years ago by rransom

Remember to remove the trailing underscore from the option name, and document it in the man page.

comment:31 in reply to:  30 Changed 7 years ago by arma

Replying to rransom:

Remember to remove the trailing underscore from the option name, and document it in the man page.

I'm hoping we'll deprecate the config option before 0.2.4 goes stable. It's mainly a way to back out quickly in an emergency, rather than a new feature we want to support.

comment:32 in reply to:  26 Changed 7 years ago by weasel

It appears self testing dir port reachability might have broken:

I have no idea why this might be the case. moria1 found itself reachable. Maybe your dirport is saturated with requests, and one of your firewall rules to defend it ends up dropping some?

Has it, 4 days later, still failed to find its dirport reachable?

It still hasn't found itself reachable, but that shouldn't matter much for an authority AIUI. It's probably, as you guess, my iptables rules dropping some requests - I get about 200 connection attempts a second, but only accept a fraction of those.

comment:33 Changed 7 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final
Priority: criticalmajor

Okay. the "big hammer" vesion got merged into 0.2.4.11-alpha ; I am calling the remainder of this a "clean up and analyze" set of tasks that 0.2.5 can do.

If there are further tasks on this that need to go into 0.2.4, please open new tickets. IFor 0.2.5, we should also favor new tickets.

comment:34 Changed 6 years ago by nickm

Resolution: fixed
Status: newclosed

In 0.2.5, we just ripped out all the v2 directory code with #10758. I think that closes this ticket. Please reopen if I'm being silly.

Note: See TracTickets for help on using tickets.