Opened 3 months ago

Last modified 3 months ago

#31521 new defect

Investigate 10-second delay in TTFB

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Normal Keywords: scalability performance
Cc: metrics-team, arthuredelstein Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

While looking into OnionPerf data I noticed a 10-second delay in time to first byte. I'll attach an ECDF shortly.

I started hunting down this issue and found that many of these cases (though not all of them) had their stream detached from a circuit and re-attached to another circuit following a 10-second timeout of some sort. The following example shows relevant controller events:

2019-05-05 09:55:00 1557046500.54 650 STREAM 45043 NEW 0 137.50.19.2:80 SOURCE_ADDR=127.0.0.1:36454 PURPOSE=USER
2019-05-05 09:55:00 1557046500.54 650 STREAM 45043 SENTCONNECT 29430 137.50.19.2:80
2019-05-05 09:55:00 1557046500.69 650 STREAM_BW 45043 13 2 2019-05-05T08:55:00.682587
                 ^^ <- 10 second delay here
2019-05-05 09:55:10 1557046510.69 650 STREAM 45043 DETACHED 29430 137.50.19.2:80 REASON=TIMEOUT
2019-05-05 09:55:10 1557046510.69 650 STREAM 45043 SENTCONNECT 29411 137.50.19.2:80
2019-05-05 09:55:11 1557046511.12 650 STREAM 45043 REMAP 29411 137.50.19.2:80 SOURCE=EXIT
2019-05-05 09:55:11 1557046511.12 650 STREAM 45043 SUCCEEDED 29411 137.50.19.2:80
2019-05-05 09:55:11 1557046511.68 650 STREAM_BW 45043 55 10 2019-05-05T08:55:11.682353
2019-05-05 09:55:12 1557046512.68 650 STREAM_BW 45043 0 637971 2019-05-05T08:55:12.681636
2019-05-05 09:55:13 1557046513.21 650 STREAM_BW 45043 0 410673 2019-05-05T08:55:13.211188
2019-05-05 09:55:13 1557046513.21 650 STREAM 45043 CLOSED 29411 137.50.19.2:80 REASON=DONE

1% of measurements seems a lot to me, and I could imagine that these cases are particularly annoying to users. Maybe this timeout could be shorter or made more dynamic like other timeouts.

If the timeout cannot be changed, it would be nice to tell the user that their stream has just been attached to another circuit and that that's why they had to wait for the past 10 seconds.

Child Tickets

Attachments (1)

onionperf-ten-second-delay-2019-08-26.png (141.0 KB) - added by karsten 3 months ago.

Download all attachments as: .zip

Change History (5)

Changed 3 months ago by karsten

comment:1 Changed 3 months ago by karsten

Note the steep incline shortly after the 10-second mark:


comment:3 Changed 3 months ago by gaba

Keywords: scalability performance added

comment:4 Changed 3 months ago by mikeperry

Cc: arthuredelstein added

This sounds about the rate of bad exit connectivity that arthuredelstein noticed during his scans a year or so ago. IIRC he noticed it for DNS timeouts though. In this case DNS appears not to be involved/at fault.

There are three ways to deal with this:

  1. Continue exit scanning constantly and hassle Exits to help diagnose/fix this.
  2. Test that pre-built Exit circuits can resolve a test domain like example.com (may not handle this case).
  3. Implement an Adaptive Stream Timeout (similar to/based on Circuit Build Timeout) rather than the fixed 10s value.

Option 2 is the only sure shot. There is a trac ticket where we investigated doing adaptive stream timeouts in addition/instead of CBT as part of a Google Summer of Code + Research project. It did not help the general case, but it would help this long tail. I can't find the ticket at the moment, but it is in the performance R&D kanban under the "Develop" column.

Last edited 3 months ago by mikeperry (previous) (diff)
Note: See TracTickets for help on using tickets.