Opened 6 months ago

Last modified 11 days ago

#30344 merge_ready defect

conn_read_callback is called on connections that are marked for closed

Reported by: robgjansen Owned by: asn
Priority: Medium Milestone: Tor: 0.4.1.x-final
Component: Core Tor/Tor Version: Tor: 0.3.5.8
Severity: Normal Keywords: tor-conn, 035-backport, 041-deferred-20190530, 042-should, nickm-merge, BugSmashFund
Cc: iang, pastly Actual Points:
Parent ID: Points:
Reviewer: dgoulet Sponsor:

Description

There is a bug causing busy loops in Libevent and infinite loops in the Shadow simulator. In my Shadow simulations, I have observed that conn_read_callback is getting called on a connection that is marked for close.

This is similar to #5263, and as described there, I believe it is a bug for conn_read_callback to be called on sockets that are marked for close.

The bug is problematic in Shadow when the marked connection also wants to flush, but attempting to write the outbuf inside conn_close_if_marked fails because the write would block (because the kernel write buffer is already full). Because it still wants to flush, the connection does not get closed, but the connection remains readable causing Libevent to continue looping on conn_read_callback until the socket buffer can actually write. This wastes CPU resources in Tor, and causes an infinite loop in Shadow because Shadow's discrete-event time does not advance during this loop.

I found the bug on 0.3.5.8, and it is probably present at least since then.

I think the conn should not be waiting for read events when it is marked. I'm not sure where in the code this assumption is first violated, but the following patch fixed the issue in my Shadow simulations:

diff --git a/src/core/mainloop/mainloop.c b/src/core/mainloop/mainloop.c
index 6e2b300..e82c77a 100644
--- a/src/core/mainloop/mainloop.c
+++ b/src/core/mainloop/mainloop.c
@@ -894,6 +894,21 @@ conn_read_callback(evutil_socket_t fd, short event, void *_conn)
   }
   assert_connection_ok(conn, time(NULL));
 
+  if (conn->marked_for_close && connection_is_reading(conn)) {
+      /* Libevent says we can read, but we are marked so we will never try
+       * to read again. We will try to close the connection below inside of
+       * close_closeable_connections(), but let's make sure not to cause
+       * Libevent to spin on conn_read_callback() while we wait for the
+       * socket to let us flush to it.*/
+      connection_stop_reading(conn);
+
+      /* Now, if we still have data to flush, then make sure Libevent tells
+       * us when the conn will allow us to write the bytes. */
+      if (connection_wants_to_flush(conn) && !connection_is_writing(conn)) {
+          connection_start_writing(conn);
+      }
+  }
+
   if (smartlist_len(closeable_connection_lst))
     close_closeable_connections();
 }

Child Tickets

TicketStatusOwnerSummaryComponent
#31958closeddgouletconnection_dir_is_anonymous: Non-fatal assertion !(CONST_TO_OR_CIRCUIT(circ)->p_chan == NULL) failedCore Tor/Tor

Change History (18)

comment:1 Changed 6 months ago by robgjansen

I wonder if the better place to stop reading on marked connections is inside of connection_mark_for_close_internal_, which appears to be the only place (outside of testing code) where the conn->marked_for_close state variable is written (modified).

diff --git a/src/core/mainloop/connection.c b/src/core/mainloop/connection.c
index f2a646c..e24d349 100644
--- a/src/core/mainloop/connection.c
+++ b/src/core/mainloop/connection.c
@@ -941,6 +941,12 @@ connection_mark_for_close_internal_, (connection_t *conn,
    * the number of seconds since last successful write, so
    * we get our whole 15 seconds */
   conn->timestamp_last_write_allowed = time(NULL);
+
+  /* We should never listen for read events on marked connections, because
+   * we never try to actually read from the connection again. */
+  if (connection_is_reading(conn)) {
+    connection_stop_reading(conn);
+  }
 }
 
 /** Find each connection that has hold_open_until_flushed set to

An issue with this approach is even though we disable read events as above, we might enable it somewhere else (which would be a bug).

To ensure that bug never occurs, we could add a check in connection_start_reading so that we return if the connection is marked.

diff --git a/src/core/mainloop/mainloop.c b/src/core/mainloop/mainloop.c
index e82c77a..7922156 100644
--- a/src/core/mainloop/mainloop.c
+++ b/src/core/mainloop/mainloop.c
@@ -641,6 +641,10 @@ connection_start_reading,(connection_t *conn))
     return;
   }
 
+  if (conn->marked_for_close) {
+      return;
+  }
+
   if (conn->linked) {
     conn->reading_from_linked_conn = 1;
     if (connection_should_read_from_linked_conn(conn))

Or maybe the logic should be added to connection_check_event?

(Note that my diffs are on a custom branch and may not apply cleanly.)

comment:2 Changed 6 months ago by dgoulet

Keywords: tor-conn 035-backport added
Milestone: Tor: 0.4.1.x-final

comment:3 Changed 5 months ago by nickm

Keywords: 041-deferred-20190530 added

Marking these tickets as deferred from 041.

comment:4 Changed 5 months ago by nickm

Milestone: Tor: 0.4.1.x-finalTor: 0.4.2.x-final

comment:5 Changed 6 weeks ago by nickm

Keywords: 042-should added

comment:6 Changed 3 weeks ago by ahf

Owner: set to asn
Status: newassigned

Distributing 0.4.2 tickets between network team members.

comment:7 Changed 3 weeks ago by pastly

Cc: pastly added

comment:8 Changed 3 weeks ago by asn

Cc: pastly removed
Status: assignedneeds_information

Hello Rob, I took your patch from the top comment and slimmed it down a bit to https://github.com/torproject/tor/pull/1380

I removed the parts about the writing/flushing because I did not understand exactly what they were doing, and also because the patch from comment:1 did not seem to need them. What are we missing by not doing it?

Also I did not do the approach from comment:1 because we don't want to add stuff to connection_mark_for_close() (Also see the documentation change I did to this effect).

Let me know how you like this, and if it works for you please!

comment:9 Changed 3 weeks ago by robgjansen

Cc: pastly added

Hello! Pastly has been running some Shadow experiments lately where he was experiencing the issue described in this ticket. He fixed it in a 3rd place (i.e., not my patch and not your patch). He is going to go back and test your patch to make sure it does indeed fix the problem.

comment:10 Changed 3 weeks ago by asn

Thanks Rob and pastly. Let me know what you find and I will act accordingly.

comment:11 Changed 2 weeks ago by pastly

Status: needs_informationneeds_review

asn's branch fixes the issue.

Two experiments:

  1. asn's branch at 0efa7827e476f63c442ac0536aa874458449ef78. ~300 relay Shadow network with appropriate client load.
  1. same, but with at de66bed604377db23cfa303b83e385ef59121a64 (the commit just before asn's fix).

The first has been running for 48 of 60 simulation minutes and is still going strong. The second hit the infinite loop just 6 simulated minutes in and has died as a result of a shadow buffer overflowing from this infinite loop in Tor. Hitting the infinite loop 5 or 6 minutes in to the experiment is consistent with what I've been seeing. (The clients start generating traffic at the 5 minute mark)

Changing ticket to (rolls dice) needs_review

comment:12 Changed 13 days ago by dgoulet

Reviewer: dgoulet

comment:13 Changed 13 days ago by dgoulet

Keywords: nickm-merge added

This LGTM as soon as the changes file is changed and CI passes, merge_ready it should be!

(Oh also, there is a comment from teor about a typo in a comment :).

comment:14 Changed 12 days ago by asn

Status: needs_reviewmerge_ready

Pushed fixes to changes file and teor's comment. Marking as merge_ready.

comment:15 Changed 12 days ago by dgoulet

Status: merge_readyneeds_revision

asn, this is still failing on the CI.

Do not prefix versions with 'tor-'. ('0.1.2', not 'tor-0.1.2'.)

comment:16 Changed 11 days ago by asn

ugh sorry about that. force pushed the right fixup!

comment:17 Changed 11 days ago by asn

Status: needs_revisionmerge_ready

comment:18 Changed 11 days ago by nickm

Keywords: BugSmashFund added
Milestone: Tor: 0.4.2.x-finalTor: 0.4.1.x-final

Since this is marked for possible backport onto 0.3.5, I made a new bug30344_squashed_035 branch that rebases and squashes it. The new PR is at https://github.com/torproject/tor/pull/1405 .

I've merged it to master; marking it for backport.

Note: See TracTickets for help on using tickets.