Opened 12 months ago

Last modified 6 months ago

#31521 new task

Investigate 10-second delay in TTFB

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Normal Keywords: scalability performance
Cc: metrics-team, arthuredelstein Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

While looking into OnionPerf data I noticed a 10-second delay in time to first byte. I'll attach an ECDF shortly.

I started hunting down this issue and found that many of these cases (though not all of them) had their stream detached from a circuit and re-attached to another circuit following a 10-second timeout of some sort. The following example shows relevant controller events:

2019-05-05 09:55:00 1557046500.54 650 STREAM 45043 NEW 0 137.50.19.2:80 SOURCE_ADDR=127.0.0.1:36454 PURPOSE=USER
2019-05-05 09:55:00 1557046500.54 650 STREAM 45043 SENTCONNECT 29430 137.50.19.2:80
2019-05-05 09:55:00 1557046500.69 650 STREAM_BW 45043 13 2 2019-05-05T08:55:00.682587
                 ^^ <- 10 second delay here
2019-05-05 09:55:10 1557046510.69 650 STREAM 45043 DETACHED 29430 137.50.19.2:80 REASON=TIMEOUT
2019-05-05 09:55:10 1557046510.69 650 STREAM 45043 SENTCONNECT 29411 137.50.19.2:80
2019-05-05 09:55:11 1557046511.12 650 STREAM 45043 REMAP 29411 137.50.19.2:80 SOURCE=EXIT
2019-05-05 09:55:11 1557046511.12 650 STREAM 45043 SUCCEEDED 29411 137.50.19.2:80
2019-05-05 09:55:11 1557046511.68 650 STREAM_BW 45043 55 10 2019-05-05T08:55:11.682353
2019-05-05 09:55:12 1557046512.68 650 STREAM_BW 45043 0 637971 2019-05-05T08:55:12.681636
2019-05-05 09:55:13 1557046513.21 650 STREAM_BW 45043 0 410673 2019-05-05T08:55:13.211188
2019-05-05 09:55:13 1557046513.21 650 STREAM 45043 CLOSED 29411 137.50.19.2:80 REASON=DONE

1% of measurements seems a lot to me, and I could imagine that these cases are particularly annoying to users. Maybe this timeout could be shorter or made more dynamic like other timeouts.

If the timeout cannot be changed, it would be nice to tell the user that their stream has just been attached to another circuit and that that's why they had to wait for the past 10 seconds.

Child Tickets

Attachments (1)

onionperf-ten-second-delay-2019-08-26.png (141.0 KB) - added by karsten 12 months ago.

Download all attachments as: .zip

Change History (8)

Changed 12 months ago by karsten

comment:1 Changed 12 months ago by karsten

Note the steep incline shortly after the 10-second mark:


comment:3 Changed 12 months ago by gaba

Keywords: scalability performance added

comment:4 Changed 12 months ago by mikeperry

Cc: arthuredelstein added

This sounds about the rate of bad exit connectivity that arthuredelstein noticed during his scans a year or so ago. IIRC he noticed it for DNS timeouts though. In this case DNS appears not to be involved/at fault.

There are three ways to deal with this:

  1. Continue exit scanning constantly and hassle Exits to help diagnose/fix this.
  2. Test that pre-built Exit circuits can resolve a test domain like example.com (may not handle this case).
  3. Implement an Adaptive Stream Timeout (similar to/based on Circuit Build Timeout) rather than the fixed 10s value.

Option 2 is the only sure shot. There is a trac ticket where we investigated doing adaptive stream timeouts in addition/instead of CBT as part of a Google Summer of Code + Research project. It did not help the general case, but it would help this long tail. I can't find the ticket at the moment, but it is in the performance R&D kanban under the "Develop" column.

Last edited 12 months ago by mikeperry (previous) (diff)

comment:5 Changed 8 months ago by arthuredelstein

I'm still running the exit scan, at https://arthuredelstein.net/exits

Two other possible fixes could be:

  1. Relays disabled themselves when something (DNS or other essential feature) is not working. #24014 covers the DNS case.
  2. Continuously run scans and flag broken exits with the BADEXIT flag.
Last edited 8 months ago by arthuredelstein (previous) (diff)

comment:6 Changed 7 months ago by karsten

Type: defecttask

Cleaning up tickets a bit to get a clearer picture of our product backlog. This is not a bug in metrics code, so defect is not the right ticket type. It's rather a task.

comment:7 Changed 6 months ago by dennis.jackson

Some data suggesting this is related to exit node congestion is discussed in ticket:33076#comment:23

Note: See TracTickets for help on using tickets.