Opened 3 months ago
Last modified 3 months ago
#31521 new defect
Investigate 10-second delay in TTFB
Reported by: | karsten | Owned by: | metrics-team |
---|---|---|---|
Priority: | Medium | Milestone: | |
Component: | Metrics/Analysis | Version: | |
Severity: | Normal | Keywords: | scalability performance |
Cc: | metrics-team, arthuredelstein | Actual Points: | |
Parent ID: | Points: | ||
Reviewer: | Sponsor: |
Description
While looking into OnionPerf data I noticed a 10-second delay in time to first byte. I'll attach an ECDF shortly.
I started hunting down this issue and found that many of these cases (though not all of them) had their stream detached from a circuit and re-attached to another circuit following a 10-second timeout of some sort. The following example shows relevant controller events:
2019-05-05 09:55:00 1557046500.54 650 STREAM 45043 NEW 0 137.50.19.2:80 SOURCE_ADDR=127.0.0.1:36454 PURPOSE=USER 2019-05-05 09:55:00 1557046500.54 650 STREAM 45043 SENTCONNECT 29430 137.50.19.2:80 2019-05-05 09:55:00 1557046500.69 650 STREAM_BW 45043 13 2 2019-05-05T08:55:00.682587 ^^ <- 10 second delay here 2019-05-05 09:55:10 1557046510.69 650 STREAM 45043 DETACHED 29430 137.50.19.2:80 REASON=TIMEOUT 2019-05-05 09:55:10 1557046510.69 650 STREAM 45043 SENTCONNECT 29411 137.50.19.2:80 2019-05-05 09:55:11 1557046511.12 650 STREAM 45043 REMAP 29411 137.50.19.2:80 SOURCE=EXIT 2019-05-05 09:55:11 1557046511.12 650 STREAM 45043 SUCCEEDED 29411 137.50.19.2:80 2019-05-05 09:55:11 1557046511.68 650 STREAM_BW 45043 55 10 2019-05-05T08:55:11.682353 2019-05-05 09:55:12 1557046512.68 650 STREAM_BW 45043 0 637971 2019-05-05T08:55:12.681636 2019-05-05 09:55:13 1557046513.21 650 STREAM_BW 45043 0 410673 2019-05-05T08:55:13.211188 2019-05-05 09:55:13 1557046513.21 650 STREAM 45043 CLOSED 29411 137.50.19.2:80 REASON=DONE
1% of measurements seems a lot to me, and I could imagine that these cases are particularly annoying to users. Maybe this timeout could be shorter or made more dynamic like other timeouts.
If the timeout cannot be changed, it would be nice to tell the user that their stream has just been attached to another circuit and that that's why they had to wait for the past 10 seconds.
Child Tickets
Attachments (1)
Change History (5)
Changed 3 months ago by
Attachment: | onionperf-ten-second-delay-2019-08-26.png added |
---|
comment:1 Changed 3 months ago by
comment:3 Changed 3 months ago by
Keywords: | scalability performance added |
---|
comment:4 Changed 3 months ago by
Cc: | arthuredelstein added |
---|
This sounds about the rate of bad exit connectivity that arthuredelstein noticed during his scans a year or so ago. IIRC he noticed it for DNS timeouts though. In this case DNS appears not to be involved/at fault.
There are three ways to deal with this:
- Continue exit scanning constantly and hassle Exits to help diagnose/fix this.
- Test that pre-built Exit circuits can resolve a test domain like example.com (may not handle this case).
- Implement an Adaptive Stream Timeout (similar to/based on Circuit Build Timeout) rather than the fixed 10s value.
Option 2 is the only sure shot. There is a trac ticket where we investigated doing adaptive stream timeouts in addition/instead of CBT as part of a Google Summer of Code + Research project. It did not help the general case, but it would help this long tail. I can't find the ticket at the moment, but it is in the performance R&D kanban under the "Develop" column.
Note the steep incline shortly after the 10-second mark: