Opened 22 months ago

Last modified 17 months ago

#21345 new task

Do relays count dir reqs as completed before they're complete?

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: triaged-out-20170308 tor-relay directory metrics
Cc: teor, karsten, dgoulet Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor:

Description

I've been eyeing the spike in .ae users, and I proposed to dgoulet that directory mirrors are mis-counting the number of completed consensus fetches. In particular, I proposed that they are counting consensus fetches as complete before they actually complete.

This ticket is for documenting my investigation of whether I think this misbehavior is really happening.

Child Tickets

Change History (10)

comment:1 Changed 22 months ago by arma

Here was the original write-up that I did on my airplane this weekened. I have since come to question whether it's actually true, i.e. whether there's actually a bug.

Directory mirrors include a dirreq-v3-reqs line in their extrainfo descriptors, including a count of how many consensus fetches were successfully delivered to users.

We decide it's completed when geoip_change_dirreq_state() gets called and either of these is true:

  if ((type == DIRREQ_DIRECT &&
         new_state == DIRREQ_FLUSHING_DIR_CONN_FINISHED) ||
      (type == DIRREQ_TUNNELED &&
         new_state == DIRREQ_CHANNEL_BUFFER_FLUSHED)) {

That first one (direct) happens from connection_dir_finished_flushing() which is called from connection_finished_flushing() which is called from connection_handle_write_impl() which only calls finished_flushing if there's still nothing to flush after it called flushed_some, and it's complicated, but I think connection_dirserv_add_networkstatus_bytes_to_outbuf() will keep adding more bytes to the outbuf until there aren't any more to add. So long story short, for the DIRREQ_DIRECT case, I think we don't mark it completed until the last byte gets flushed from the directory conn. That's good.

That second one (tunneled) is where the problem is. It gets called from channel_notify_flushed(), which gets called from connection_handle_write_impl() whenever buf_datalen(outbuf) is 0 after flush_buf_tls() finishes. So, the directory request gets marked as completed as soon as the first 16384 (DIRSERV_BUFFER_MIN) bytes get flushed onto the linked TLS conn. And on Linux, the kernel is happy to lie to you about the first several hundred KB of data, telling you it flushed when it hasn't done anything but get moved to kernel buffers.

tl;dr, we mark a tunneled networkstatus directory response as completed as soon as we start answering it.

comment:2 Changed 22 months ago by arma

Here's the patch that I applied to my directory mirror to see whether the above behavior is happening the way I expected:

diff --git a/src/or/dirserv.c b/src/or/dirserv.c
index 71dcbb6..595ca0c 100644
--- a/src/or/dirserv.c
+++ b/src/or/dirserv.c
@@ -3655,9 +3655,13 @@ connection_dirserv_add_dir_bytes_to_outbuf(dir_connection_t *conn)
                             bytes, TO_CONN(conn));
   }
   conn->cached_dir_offset += bytes;
+  log_notice(LD_DIR, "Added %d bytes to dirreq %ld",
+             (int)bytes, conn->dirreq_id);
   if (conn->cached_dir_offset == (int)conn->cached_dir->dir_z_len) {
     /* We just wrote the last one; finish up. */
     connection_dirserv_finish_spooling(conn);
+    log_notice(LD_DIR, "Declaring that dirreq_id %ld is complete",
+               conn->dirreq_id);
     cached_dir_decref(conn->cached_dir);
     conn->cached_dir = NULL;
   }
diff --git a/src/or/geoip.c b/src/or/geoip.c
index 74811ea..afc04c9 100644
--- a/src/or/geoip.c
+++ b/src/or/geoip.c
@@ -817,6 +817,7 @@ geoip_change_dirreq_state(uint64_t dirreq_id, dirreq_type_t type,
          new_state == DIRREQ_FLUSHING_DIR_CONN_FINISHED) ||
       (type == DIRREQ_TUNNELED &&
          new_state == DIRREQ_CHANNEL_BUFFER_FLUSHED)) {
+    log_notice(LD_DIR, "Marked dirreq_id %ld as completed.", dirreq_id);
     tor_gettimeofday(&ent->completion_time);
     ent->completed = 1;
   }

comment:3 Changed 22 months ago by arma

Using the directory mirror as my bridge, and attaching a fresh client to it, I see:

Jan 29 00:29:50.742 [notice] Added 16185 bytes to dirreq 5
Jan 29 00:29:50.742 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.742 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.742 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.742 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.742 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.742 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.742 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.743 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.743 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.743 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:50.743 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.112 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.112 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.112 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.112 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.112 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.112 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.157 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.157 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.157 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.210 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.210 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.246 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.300 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.300 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.302 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.334 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.334 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.389 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.397 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.397 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.397 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.424 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.424 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.479 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.491 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.491 [notice] Added 16384 bytes to dirreq 5
Jan 29 00:29:51.491 [notice] Added 5073 bytes to dirreq 5
Jan 29 00:29:51.491 [notice] Declaring that dirreq_id 5 is complete

So actually, hey, it looks like it's behaving as intended!

comment:4 Changed 22 months ago by arma

Insight: if the compressed consensus fit inside one stream window (250KB), we would behave in the bad way, i.e. we would mark the consensus fetch as complete as soon as we start answering it.

But as it is, it looks like we require at least some stream sendme cells in response before we'll call it done.

I started my client with "bandwidthrate 10000 bandwidthburst 10000", and confirmed that the directory mirror marked the dirreq as complete well before my client was finished receiving the consensus, but also well after the consensus download fetch began.

That behavior also makes me think that an edge case remains, where a client initiates the request, gets the first round of data, sends off its sendme cells, and then has the connection cut by the censor.

But at present the compressed consensus size is around 627000 bytes, which is three windows worth of stream data. So the client would need to send its round of sendmes, receive some more bytes, and send a second round of sendmes, before the dir mirror will mark the consensus fetch as complete. That's a good start!

comment:5 Changed 22 months ago by arma

Ah ha. Current theory for why it works as intended: geoip_change_dirreq_state() has to be called in every intermediate step before DIRREQ_CHANNEL_BUFFER_FLUSHED, in order, else it will never run the clauses at the end:

  if (new_state == DIRREQ_IS_FOR_NETWORK_STATUS)
    return;
  if (new_state - 1 != ent->state)
    return;
  ent->state = new_state;

So that's a call with new_state DIRREQ_FLUSHING_DIR_CONN_FINISHED, and another with new_state DIRREQ_END_CELL_SENT, and another with new_state DIRREQ_CIRC_QUEUE_FLUSHED, and only then will it care when it receives a call with new_state DIRREQ_CHANNEL_BUFFER_FLUSHED. Sneaky.

comment:6 Changed 22 months ago by arma

Cc: teor karsten dgoulet added

I am on track to closing this ticket as not-a-bug / works-for-me, but I'm going to leave it open for a bit for others to follow my logic if they want. cc'ing dgoulet and karsten and teor as the usual suspects who might care.

Another plan before closing might be to open a ticket for clean-up and documentation of this internal state machine, since any edge cases where we don't call *every* state change will result in not counting the consensus fetch as complete.

Last edited 22 months ago by arma (previous) (diff)

comment:7 Changed 22 months ago by dcf

arma, these other tickets might be relevant to your investigation.

comment:13:ticket:21014: Turkey graphs increased during a Tor block. Hypothesis is that interrupted downloads are inflating the perceived number of directory requests.

#18203 is a proposal to base direct-user counts on directory responses, rather than directory requests. Doing that might solve this overcounting issue. (Apparently bridge counts are already based on responses.)

#18203: has one reason why bridges and relays could be counted differently.

We're using the number of responses to estimate bridge users but the number of requests to estimate direct users.

comment:8 Changed 21 months ago by nickm

Points: 3

comment:9 Changed 21 months ago by nickm

Keywords: triaged-out-20170308 added
Milestone: Tor: 0.3.1.x-finalTor: unspecified

Deferring all 0.3.1 tickets with status == new, owner == nobody, sponsor == nobody, points > 0.5, and priority < high.

I'd still take patches for most of these -- there's just nobody currently lined up to work on them in this timeframe.

comment:10 Changed 17 months ago by nickm

Keywords: tor-relay directory metrics added
Note: See TracTickets for help on using tickets.