Opened 4 weeks ago

Last modified 7 days ago

#33033 needs_review defect

sbws stuck thinking a destination is dead

Reported by: tom Owned by:
Priority: High Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version: sbws: 1.1.0
Severity: Normal Keywords: sbws-majority-blocker, sbws-roadmap
Cc: juga, ln5 Actual Points:
Parent ID: #33121 Points:
Reviewer: juga Sponsor:

Description

It looks like around Jan 19th we couldn't talk to our (singular) destination:

grep "The last 9 times the destination" * | cut -d ":"  -f 2 | sort | uniq -c
    507 Jan 19 20
   3132 Jan 19 21
   3121 Jan 19 22
   2809 Jan 19 23
   3120 Jan 20 00
   3110 Jan 20 01
   2800 Jan 20 02
   3128 Jan 20 03
   3130 Jan 20 04
   3139 Jan 20 05
   2834 Jan 20 06
   3151 Jan 20 07
   3158 Jan 20 08
   2844 Jan 20 09
   3174 Jan 20 10
   3180 Jan 20 11
   2862 Jan 20 12
   3192 Jan 20 13
   3200 Jan 20 14
   3212 Jan 20 15
   2878 Jan 20 16
   3180 Jan 20 17
   3181 Jan 20 18
   2846 Jan 20 19
   3136 Jan 20 20
   3130 Jan 20 21
   2823 Jan 20 22
   3150 Jan 20 23
   3150 Jan 21 00
   2835 Jan 21 01
   3150 Jan 21 02
   3152 Jan 21 03
   3162 Jan 21 04
   2854 Jan 21 05
   3180 Jan 21 06
   3180 Jan 21 07
   2862 Jan 21 08
   3190 Jan 21 09
   3190 Jan 21 10
   2862 Jan 21 11
   3180 Jan 21 12
   3171 Jan 21 13
   2853 Jan 21 14
   3171 Jan 21 15
   3170 Jan 21 16
   2846 Jan 21 17
   3153 Jan 21 18
   3136 Jan 21 19
   2805 Jan 21 20
   3110 Jan 21 21
   3110 Jan 21 22
   2794 Jan 21 23
   3100 Jan 22 00
   3100 Jan 22 01
   2790 Jan 22 02
   3108 Jan 22 03
   3124 Jan 22 04
   2817 Jan 22 05
   3136 Jan 22 06
   3150 Jan 22 07
   2837 Jan 22 08
   3160 Jan 22 09
   3162 Jan 22 10
   2853 Jan 22 11
   3170 Jan 22 12
   3172 Jan 22 13
   2871 Jan 22 14
   3190 Jan 22 15
   3190 Jan 22 16
   2871 Jan 22 17
   3190 Jan 22 18
   3154 Jan 22 19
   2835 Jan 22 20
   3142 Jan 22 21
   3140 Jan 22 22
   2812 Jan 22 23
   3120 Jan 23 00
   3120 Jan 23 01
   2814 Jan 23 02
   3130 Jan 23 03
   2812 Jan 23 04
   3130 Jan 23 05
   2198 Jan 23 06

However we aren't retrying it properly (and the log message isn't complete):

grep "was not tried" scanner.log* | cut -d ":" -f 2 | sort | uniq -c
      1 Jan 19 20
      3 Jan 19 21
      1 Jan 19 22
      1 Jan 19 23
      1 Jan 20 02
      1 Jan 20 07
      1 Jan 20 18
      1 Jan 21 15
grep "was not tried" scanner.log*
scanner.log.18:Jan 21 15:32:03 INFO Thread-2 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.28:Jan 20 18:15:50 INFO Thread-2 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.32:Jan 20 07:32:22 INFO Thread-2 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.35:Jan 20 02:10:50 INFO Thread-4 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.36:Jan 19 23:30:03 INFO Thread-4 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.37:Jan 19 20:55:41 INFO Thread-3 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.37:Jan 19 21:01:49 INFO Thread-3 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.37:Jan 19 21:14:09 INFO Thread-2 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.37:Jan 19 21:32:37 INFO Thread-4 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.
scanner.log.37:Jan 19 22:15:55 INFO Thread-4 destination.py:251 - is_functional - The destination %s was not tried for %s hours, it is going to by tried again.

Child Tickets

Change History (8)

comment:1 Changed 4 weeks ago by tom

(Also, the destination is not currently dead.)

comment:2 Changed 3 weeks ago by tom

This happened again. I'm testing a local patch that places a max on the time to wait and adds more logging.

Something I have not been able to figure out though, is why I get massive number of log messages all on the same thread at the same second. I imagine this is part of the problem for why the failure cascades so quickly...

Jan 26 14:18:32 WARNING Thread-4 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-3 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-3 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-4 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-4 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-3 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-3 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-4 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-4 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-3 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-3 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-4 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-4 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-3 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-3 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-4 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-4 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-3 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-3 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-4 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.
Jan 26 14:18:32 WARNING Thread-4 destination.py:245 - is_functional - Please, add more destinations or increment the number of maximum number of consecutive failures in the configuration.
Jan 26 14:18:32 WARNING Thread-3 destination.py:244 - is_functional - The last 9 times the destination https://bw.bwauth.domain/1G failed.Disabled for 5120.0 minutes.

comment:3 Changed 3 weeks ago by juga

Milestone: sbws: 1.1.x-final
Priority: MediumHigh
Version: sbws: 1.1.0

Thanks tom for reporting this bug.
Would you share the patch as an attachment here or somewhere else?
I'm changing this priority of this ticket because it seems like a blocker to deploy more sbws instances.

comment:4 Changed 3 weeks ago by tom

Still in testing.

diff --git a/sbws/globals.py b/sbws/globals.py
index e9fd8dd..b50384b 100644
--- a/sbws/globals.py
+++ b/sbws/globals.py
@@ -130,6 +130,9 @@ NUM_DESTINATION_ATTEMPTS_STORED = 10
 # Because intermitent failures with CDN destinations, start trying again
 # after 5 min.
 DELTA_SECONDS_RETRY_DESTINATION = 60 * 5
+# No matter what, do not increase the wait time between destination reties
+# past this value.
+MAX_SECONDS_BETWEEN_DESTINATION_RETRIES = 60 * 60 * 3
 # Number of consecutive times a destination can fail before considering it
 # not functional.
 MAX_NUM_DESTINATION_FAILURES = 3
diff --git a/sbws/lib/destination.py b/sbws/lib/destination.py
index 3ccd94f..a53b8ca 100644
--- a/sbws/lib/destination.py
+++ b/sbws/lib/destination.py
@@ -11,6 +11,7 @@ import sbws.util.stem as stem_utils
 from ..globals import (
     MAX_NUM_DESTINATION_FAILURES,
     DELTA_SECONDS_RETRY_DESTINATION,
+    MAX_SECONDS_BETWEEN_DESTINATION_RETRIES,
     NUM_DESTINATION_ATTEMPTS_STORED,
     FACTOR_INCREMENT_DESTINATION_RETRY
     )
@@ -121,7 +122,7 @@ def connect_to_destination_over_circuit(dest, circ_id, session, cont, max_dl):
             '{} not {}'.format(requests.codes.ok, head.status_code)
     if 'content-length' not in head.headers:
         dest.add_failure()
-        return False, error_prefix + 'we except the header Content-Length '\
+        return False, error_prefix + 'we expect the header Content-Length '\
             'to exist in the response'
     content_length = int(head.headers['content-length'])
     if max_dl > content_length:
@@ -143,6 +144,7 @@ class Destination:
     def __init__(self, url, max_dl, verify,
                  max_num_failures=MAX_NUM_DESTINATION_FAILURES,
                  delta_seconds_retry=DELTA_SECONDS_RETRY_DESTINATION,
+                 max_seconds_between_retries=MAX_SECONDS_BETWEEN_DESTINATION_RETRIES,
                  num_attempts_stored=NUM_DESTINATION_ATTEMPTS_STORED,
                  factor_increment_retry=FACTOR_INCREMENT_DESTINATION_RETRY):
         """Initalizes the Web server from which the data is downloaded.
@@ -169,6 +171,8 @@ class Destination:
         # Default delta time to try a destination that was not functional.
         self._default_delta_seconds_retry = delta_seconds_retry
         self._delta_seconds_retry = delta_seconds_retry
+        # A cap on the time to wait between destination retries.
+        self._max_seconds_between_retries = max_seconds_between_retries
         # Using a deque (FIFO) to do not grow forever and
         # to do not have to remove old attempts.
         # Store tuples of timestamp and whether the destination succed or not
@@ -201,20 +205,30 @@ class Destination:
         Increment the time a destination will be tried again by a ``factor``.
         """
         self._delta_seconds_retry *= factor or self._factor
-        log.info("Incremented the time to try destination %s to %s hours.",
-                 self.url, self._delta_seconds_retry / 60 / 60)
+        if self._delta_seconds_retry > self._max_seconds_between_retries:
+            self._delta_seconds_retry = self._max_seconds_between_retries
+            log.info("Incremented the time to try destination %s past the "
+                     "limit, capping it at %s hours.",
+                     self.url, self._delta_seconds_retry / 60 / 60)
+        else:
+            log.info("Incremented the time to try destination %s to %s hours.",
+                     self.url, self._delta_seconds_retry / 60 / 60)
 
-    def _is_last_try_old_enough(self, n=None):
+    def _get_last_try_in_seconds_ago(self):
         """
-        Return True if the last time it was used it was ``n`` seconds ago.
+        Return the delta between the last try and now, as positive seconds.
         """
         # Timestamp of the last attempt.
         last_time = self._attempts[-1][0]
+        return (datetime.datetime.utcnow() - last_time).total_seconds()
+
+    def _is_last_try_old_enough(self, n=None):
+        """
+        Return True if the last time it was used it was ``n`` seconds ago.
+        """
         # If the last attempt is older than _delta_seconds_retry, try again
-        return (datetime.datetime.utcnow()
-                - datetime.timedelta(seconds=self._delta_seconds_retry)
-                > last_time)
-        return False
+        return (self._get_last_try_in_seconds_ago() >
+                self._delta_seconds_retry)
 
     def is_functional(self):
         """Whether connections to a destination are failing or not.
@@ -238,18 +252,21 @@ class Destination:
         if self._are_last_attempts_failures():
             # The log here will appear in all the the queued
             #  relays and threads.
-            log.warning("The last %s times the destination %s failed."
-                        "Disabled for %s minutes.",
+            log.warning("The last %s times the destination %s failed. "
+                        "It last ran %s seconds ago. "
+                        "Disabled for %s seconds.",
                         self._max_num_failures, self.url,
-                        self._delta_seconds_retry / 60)
+                        self._get_last_try_in_seconds_ago(),
+                        self._delta_seconds_retry)
             log.warning("Please, add more destinations or increment the "
                         "number of maximum number of consecutive failures "
                         "in the configuration.")
             # It was not used for a while and the last time it was used
             # was long ago, then try again
             if self._is_last_try_old_enough():
-                log.info("The destination %s was not tried for %s hours, "
-                         "it is going to by tried again.")
+                log.info("The destination %s was not tried for %s seconds, "
+                         "it is going to by tried again.", self.url,
+                         self._get_last_try_in_seconds_ago())
                 # Set the next time to retry higher, in case this attempt fails
                 self._increment_time_to_retry()
                 return True

comment:5 Changed 3 weeks ago by teor

Keywords: sbws-majority-blocker sbws-roadmap added

comment:6 Changed 3 weeks ago by gaba

Parent ID: #33121

The goal is to deploy sbws in all bw authorities. We need to fix critical bugs to do this.

comment:7 Changed 8 days ago by tom

The patch above successfully let it recover from a destination timeout.

Th problem definitely appears to be related to the fact that a ton of measurement tests will (of course) fail at the same time or very nearby; meaning your failure count goes from 0 to 9 (or whatever the limit is) immediately.

comment:8 Changed 7 days ago by juga

Reviewer: juga
Status: newneeds_review

that's great news, thanks for the info an the patch!. Assigning myself to review it so that we can get the patch merged.

Note: See TracTickets for help on using tickets.