Opened 3 years ago

Closed 9 days ago

#17488 closed defect (fixed)

ExoneraTor hangs forever on old known-positive test

Reported by: starlight Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/ExoneraTor Version:
Severity: Normal Keywords: metrics-2018
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

98.113.149.36
2011-04-29

Child Tickets

Change History (19)

comment:1 Changed 3 years ago by karsten

Owner: set to karsten
Priority: MediumVery High
Status: newaccepted

Indeed, that query really keeps the database busy. I'm still investigating. Will post here once I know what's going on or have a fix. Thanks for reporting.

comment:2 Changed 3 years ago by karsten

Some early results: it seems to related to that relay being contained in TorDNSEL exit lists, and looking up those relays takes much, much longer than relays that aren't. I'll look more into tweaking that query.

comment:3 Changed 3 years ago by karsten

I'm stuck, but I'll post my findings here in the hope that somebody else can resolve this mystery.

The bottleneck seems to be the second select in function search_statusentries_by_address_date, that "finds status entries of relays having an exit list entry with the provided IP address as the exit address." That select can be really slow depending on whether the provided IP address is found in an exit list or not. Consider date 2011-04-29 and IP addresses 98.113.149.36 (the address mentioned in this ticket) and 86.59.21.38 (tor26, not an exit):

EXPLAIN ANALYZE
  SELECT statusentry.rawstatusentry,
        statusentry.descriptor,
        statusentry.validafter,
        statusentry.fingerprint,
        HOST(statusentry.oraddress),
        HOST(exitlistentry.exitaddress),
        -- Pick only the last scan result that took place in the 24 hours
        -- before the valid-after time.
        MAX(exitlistentry.scanned)
      FROM statusentry
      JOIN exitlistentry
      ON statusentry.fingerprint = exitlistentry.fingerprint
      WHERE exitlistentry.exitaddress = '98.113.149.36'::INET
      -- Focus on a time period from 1 day before and 1 day after the
      -- given date.  Also include a second day before the given date
      -- for exit lists, because it can take up to 24 hours to scan a
      -- relay again.  We shouldn't miss exit list entries here.
      AND DATE(exitlistentry.scanned) >= '2011-04-27'
      AND DATE(exitlistentry.scanned) <= '2011-04-30'
      AND DATE(statusentry.validafter) >= '2011-04-28'
      AND DATE(statusentry.validafter) <= '2011-04-30'
      -- Consider only exit list scans that took place in the 24 hours
      -- before the relay was listed in a consensus.
      AND statusentry.validafter >= exitlistentry.scanned
      AND statusentry.validafter - exitlistentry.scanned <=
          '1 day'::INTERVAL
      GROUP BY 1, 2, 3, 4, 5, 6;

Query results are:

                                                                               QUERY PLAN                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=11062.95..11062.97 rows=1 width=342) (actual time=100644.027..100644.202 rows=57 loops=1)
   ->  Nested Loop  (cost=0.01..11062.94 rows=1 width=342) (actual time=122.074..100643.033 rows=84 loops=1)
         Join Filter: ((statusentry.validafter - exitlistentry.scanned) <= '1 day'::interval)
         ->  Index Scan using exitlistentry_exitaddress_scanneddate on exitlistentry  (cost=0.01..8.82 rows=1 width=56) (actual time=23.242..68.592 rows=4 loops=1)
               Index Cond: ((exitaddress = '98.113.149.36'::inet) AND (date(scanned) >= '2011-04-27'::date) AND (date(scanned) <= '2011-04-30'::date))
         ->  Index Scan using statusentry_fingerprint_validafter on statusentry  (cost=0.00..11054.09 rows=1 width=327) (actual time=26.442..25142.994 rows=60 loops=4)
               Index Cond: ((fingerprint = exitlistentry.fingerprint) AND (validafter >= exitlistentry.scanned))
               Filter: ((date(validafter) >= '2011-04-28'::date) AND (date(validafter) <= '2011-04-30'::date))
 Total runtime: 100644.583 ms
(9 rows)
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=11062.95..11062.97 rows=1 width=342) (actual time=11.782..11.782 rows=0 loops=1)
   ->  Nested Loop  (cost=0.01..11062.94 rows=1 width=342) (actual time=11.773..11.773 rows=0 loops=1)
         Join Filter: ((statusentry.validafter - exitlistentry.scanned) <= '1 day'::interval)
         ->  Index Scan using exitlistentry_exitaddress_scanneddate on exitlistentry  (cost=0.01..8.82 rows=1 width=56) (actual time=11.766..11.766 rows=0 loops=1)
               Index Cond: ((exitaddress = '86.59.21.38'::inet) AND (date(scanned) >= '2011-04-27'::date) AND (date(scanned) <= '2011-04-30'::date))
         ->  Index Scan using statusentry_fingerprint_validafter on statusentry  (cost=0.00..11054.09 rows=1 width=327) (never executed)
               Index Cond: ((fingerprint = exitlistentry.fingerprint) AND (validafter >= exitlistentry.scanned))
               Filter: ((date(validafter) >= '2011-04-28'::date) AND (date(validafter) <= '2011-04-30'::date))
 Total runtime: 11.953 ms
(9 rows)

Note that I saw much worse total runtimes for the first query which took about 10 times as long.

Also note that I created two new indexes that aren't yet contained in ExoneraTor's database schema yet:

CREATE INDEX statusentry_validafter_fingerprint
    ON statusentry (validafter, fingerprint);
CREATE INDEX statusentry_fingerprint_validafter
    ON statusentry (fingerprint, validafter);

Before having these indexes, the above query would use statusentry_pkey, but overall runtimes were comparable. And 1.5 minutes or even 15 minutes is obviously unacceptable for this query. 1.5 seconds would be acceptable.

Hmm. What's the index that we're obviously missing? (I'd rather not want to change the table structure, but if that's really, really the only way to fix this, let's talk about it.)

Feedback much appreciated!

comment:4 Changed 3 years ago by starlight

a guess is that

AND statusentry.validafter - exitlistentry.scanned <= '1 day'::INTERVAL

is causing the DB engine to perform the subtraction
for every element in the index/column

comment:5 Changed 3 years ago by starlight

Actually I'll bet that clause induces construction
of a permutated temporary index based on all the
rows from each table!

comment:6 Changed 3 years ago by karsten

exonerator=> EXPLAIN ANALYZE
exonerator->   SELECT statusentry.rawstatusentry,
exonerator->         statusentry.descriptor,
exonerator->         statusentry.validafter,
exonerator->         statusentry.fingerprint,
exonerator->         HOST(statusentry.oraddress),
exonerator->         HOST(exitlistentry.exitaddress),
exonerator->         -- Pick only the last scan result that took place in the 24 hours
exonerator->         -- before the valid-after time.
exonerator->         MAX(exitlistentry.scanned)
exonerator->       FROM statusentry
exonerator->       JOIN exitlistentry
exonerator->       ON statusentry.fingerprint = exitlistentry.fingerprint
exonerator->       WHERE exitlistentry.exitaddress = '98.113.149.36'::INET
exonerator->       -- Focus on a time period from 1 day before and 1 day after the
exonerator->       -- given date.  Also include a second day before the given date
exonerator->       -- for exit lists, because it can take up to 24 hours to scan a
exonerator->       -- relay again.  We shouldn't miss exit list entries here.
exonerator->       AND DATE(exitlistentry.scanned) >= '2011-04-27'
exonerator->       AND DATE(exitlistentry.scanned) <= '2011-04-30'
exonerator->       AND DATE(statusentry.validafter) >= '2011-04-28'
exonerator->       AND DATE(statusentry.validafter) <= '2011-04-30'
exonerator->       GROUP BY 1, 2, 3, 4, 5, 6;
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=32395.69..32395.75 rows=4 width=342) (actual time=53974.565..53974.673 rows=72 loops=1)
   ->  Nested Loop  (cost=454.67..32395.62 rows=4 width=342) (actual time=10130.628..53972.490 rows=288 loops=1)
         ->  Index Scan using exitlistentry_exitaddress_scanneddate on exitlistentry  (cost=0.01..8.82 rows=1 width=56) (actual time=16.972..43.140 rows=4 loops=1)
               Index Cond: ((exitaddress = '98.113.149.36'::inet) AND (date(scanned) >= '2011-04-27'::date) AND (date(scanned) <= '2011-04-30'::date))
         ->  Bitmap Heap Scan on statusentry  (cost=454.66..32285.65 rows=8090 width=327) (actual time=2538.725..13481.531 rows=72 loops=4)
               Recheck Cond: (fingerprint = exitlistentry.fingerprint)
               Filter: ((date(validafter) >= '2011-04-28'::date) AND (date(validafter) <= '2011-04-30'::date))
               ->  Bitmap Index Scan on statusentry_fingerprint_validafter  (cost=0.00..454.66 rows=8090 width=0) (actual time=14.937..14.937 rows=13735 loops=4)
                     Index Cond: (fingerprint = exitlistentry.fingerprint)
 Total runtime: 53974.888 ms
(10 rows)

comment:7 Changed 3 years ago by starlight

it did cut the time in half

maybe the other AND clauses are having
some similar sort of impact? causing
the DB engine to to build full lists
before starting the evaluation?

can it be simplified or broken into
more than one step to prevent mysql
from sub-optimizing? any way to
add processing hints? (I'm not
a SQL expert) Or maybe trim the
set from the JOIN in memory after
retrieval instead of using the
WHERE clause?

comment:8 Changed 3 years ago by karsten

I tried out something different: we include hex representations of /24's (or /48's in case of IPv6) in the database, and I tried to query for those instead of the exact IP address. And it seems that the database can optimize those queries much better. 0.3 seconds instead of 54 seconds is much more reasonable:

exonerator=> EXPLAIN ANALYZE
exonerator->   SELECT statusentry.rawstatusentry,
exonerator->         statusentry.descriptor,
exonerator->         statusentry.validafter,
exonerator->         statusentry.fingerprint,
exonerator->         HOST(statusentry.oraddress),
exonerator->         HOST(exitlistentry.exitaddress),
exonerator->         -- Pick only the last scan result that took place in the 24 hours
exonerator->         -- before the valid-after time.
exonerator->         MAX(exitlistentry.scanned)
exonerator->       FROM statusentry
exonerator->       JOIN exitlistentry
exonerator->       ON statusentry.fingerprint = exitlistentry.fingerprint
exonerator->       WHERE exitlistentry.exitaddress24 = '627195'            -- ONLY CHANGE --
exonerator->       -- Focus on a time period from 1 day before and 1 day after the
exonerator->       -- given date.  Also include a second day before the given date
exonerator->       -- for exit lists, because it can take up to 24 hours to scan a
exonerator->       -- relay again.  We shouldn't miss exit list entries here.
exonerator->       AND DATE(exitlistentry.scanned) >= '2011-04-27'
exonerator->       AND DATE(exitlistentry.scanned) <= '2011-04-30'
exonerator->       AND DATE(statusentry.validafter) >= '2011-04-28'
exonerator->       AND DATE(statusentry.validafter) <= '2011-04-30'
exonerator->       -- Consider only exit list scans that took place in the 24 hours
exonerator->       -- before the relay was listed in a consensus.
exonerator->       AND statusentry.validafter >= exitlistentry.scanned
exonerator->       AND statusentry.validafter - exitlistentry.scanned <=
exonerator->           '1 day'::INTERVAL
exonerator->       GROUP BY 1, 2, 3, 4, 5, 6;
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=11181.55..11181.56 rows=1 width=342) (actual time=333.717..333.810 rows=57 loops=1)
   ->  Nested Loop  (cost=0.01..11181.53 rows=1 width=342) (actual time=0.513..333.262 rows=84 loops=1)
         Join Filter: ((statusentry.validafter - exitlistentry.scanned) <= '1 day'::interval)
         ->  Index Scan using exitlistentry_exitaddress24_scanneddate on exitlistentry  (cost=0.01..8.83 rows=1 width=56) (actual time=0.121..0.273 rows=4 loops=1)
               Index Cond: ((exitaddress24 = '627195'::bpchar) AND (date(scanned) >= '2011-04-27'::date) AND (date(scanned) <= '2011-04-30'::date))
         ->  Index Scan using statusentry_fingerprint_validafter on statusentry  (cost=0.00..11172.68 rows=1 width=327) (actual time=0.106..83.038 rows=60 loops=4)
               Index Cond: ((fingerprint = exitlistentry.fingerprint) AND (validafter >= exitlistentry.scanned))
               Filter: ((date(validafter) >= '2011-04-28'::date) AND (date(validafter) <= '2011-04-30'::date))
 Total runtime: 334.174 ms
(9 rows)

I have an implementation here that contains necessary SQL changes and that also converts IP addresses on the Java side, but I'll have to run some more tests first. So far I'm optimistic.

comment:9 Changed 3 years ago by starlight

Interesting. Seems that IP address joins are a known problem area with mysql, Googling

"mysql join on ip address slow"

pulls many relevant hits including

http://forums.mysql.com/read.php?24,229499

which has several alternate approaches and says it has something to do with it being a "theta join".

comment:10 Changed 3 years ago by karsten

Status: acceptedneeds_review

Thanks for digging that out, but we're using PostgreSQL, not MySQL. However, I'm also actually curious about making this change to always search for all hits in the same /24 or /48, because I'd like to avoid making more than one database query per request. But that's another ticket, we should first get this one fixed.

I just pushed branch-17488 to my public repository and deployed that branch on https://exonerator.torproject.org/. Want to give it a try and see if there are still performance problems with any query you come up with?

comment:11 Changed 3 years ago by starlight

For future reference, the "alternate approaches" in the referenced thread (which begins with a comparative PostgreSQL observation) were specific to MYSQL. MYSQL IP address type appears intricate.

comment:12 Changed 14 months ago by iwakeh

What status do we have here?

comment:13 Changed 14 months ago by karsten

The status is that we're currently running master which has sometimes rather slow responses. I also have an improved database schema somewhere that I worked on together with Sebastian and that would likely resolve this issue. Oh, and we included this task in a funding proposal but might take it out of that proposal again. And there's a plan to extract the database part of ExoneraTor and merge it into a common tool together with Onionoo and the data-processing modules of Tor Metrics. All in all, maybe not the best ticket to make progress on in the near future. Lots of dependencies to other tasks.

comment:14 Changed 13 months ago by karsten

Owner: changed from karsten to metrics-team
Status: needs_reviewassigned

Handing over to metrics-team, because I'm not currently working on this.

comment:15 Changed 12 months ago by karsten

Keywords: metrics-2018 added

comment:16 Changed 12 months ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:17 Changed 9 months ago by iwakeh

Keywords: metrics-2018 added; metrics-2017 removed

Will be completed in 2018.

comment:18 Changed 7 months ago by karsten

Priority: Very HighMedium
Severity: MajorNormal

This is still something we want to do, but it's not The Most Important Thing. Changing priority back to the default (medium).

comment:19 Changed 9 days ago by karsten

Resolution: fixed
Status: assignedclosed

This is finally resolved by #27356 and released in version 4.0.0. Closing.

Note: See TracTickets for help on using tickets.