Opened 4 months ago

Last modified 3 months ago

#29744 new defect

Streams sometimes stall for up to 1 hour without making any progress

Reported by: karsten Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.0.7
Severity: Normal Keywords: needs-insight usability scalability
Cc: karsten, gk, starlight@…, gaba Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We're measuring Tor performance using our OnionPerf tool by regularly downloading 5 MiB files over Tor. Some of these measurements run longer than 1 hour, after which a timeout in OnionPerf aborts them, or run for up to 30 minutes until they complete. (For comparison, 99% of successful runs complete within roughly two minutes.)

I noticed one particular source of slowness which I think is the reason for the application timeouts after 1 hour and for some of the 1% slowest successful runs: streams stall for seconds or minutes and would even stall for hours if we let them, without making any progress; and suddenly they make progress until they complete or stall again.

I'm attaching four graphs showing this problem. All these graphs show download progress over time with time on x and progress on y. Each gray bar is one measurement. The black line starts at the bottom of its gray bar and goes up to the top of that bar as more data is received. The number on the right is the stream ID.

The first two graphs show application timeouts, the last two show the slowest 1% of successful runs. First and third show downloads from a public server, second and fourth from an onion server.

Note that not all runs have this problem of stalling as described above. Some of the more obvious cases are:

  • Page 3, stream ID 436971: that stream basically does nothing for over half an hour and then completes within seconds.
  • Page 3, stream ID 436986: same as before, just with a shorter stalling period.

Other cases have different issues. For example, stream ID 34117 on page 3 is rather slow for most of the time and then suddenly gets faster at the end. However, it does not stall.

I do have tor logs and tor controller event logs for these cases. Here's a log containing many relevant STREAM and STREAM_BW events: https://people.torproject.org/~karsten/volatile/streams-2019-02-18.log.xz (61.1K)

These measurements have been made using tor versions 0.2.9.11-dev and 0.3.0.7-dev.

I can provide more data. But rather than uploading everything, please let me know what data would be most useful, and I'll provide just that.

Child Tickets

Attachments (3)

onionperf-stalling-downloads.pdf (78.3 KB) - added by karsten 4 months ago.
streams-op-ab.pdf (6.9 KB) - added by karsten 4 months ago.
streams-op-ab.log (44.9 KB) - added by karsten 4 months ago.

Download all attachments as: .zip

Change History (10)

Changed 4 months ago by karsten

comment:1 Changed 4 months ago by gk

Cc: gk added

comment:2 Changed 4 months ago by nickm

It would be useful to know if these results also happen with more recent release series.

comment:3 Changed 4 months ago by nickm

Keywords: needs-insight usability added
Milestone: Tor: unspecified

comment:4 Changed 4 months ago by karsten

I'll also look at op-ab logs which should be running a more recent tor version, and we're going to update the tor versions on the three OnionPerf instances mentioned above. This is going to take a few days if the op-ab logs contain something useful or a few weeks if we first need to update tor versions and make new measurements.

Changed 4 months ago by karsten

Attachment: streams-op-ab.pdf added

Changed 4 months ago by karsten

Attachment: streams-op-ab.log added

comment:5 Changed 4 months ago by karsten

I looked at the op-ab logs, created by tor version 0.3.5.7-dev, and tried to identify stalling runs. The data basis here is much smaller with only 2 months of logs from 1 instance as compared to 3 months of logs from 3 instances in my initial analysis above.

I found at least one measurement where the download stalled for 18 seconds (ID 103995) and a few more measurements with shorter stalls. I'm attaching a new graph and the controller events for the graphed measurements.

My current guess is that it's not an issue with the client but with one of the relays. And if it were "just" 18 seconds, it might even be acceptable for most use cases.

But I believe that most of the application-level timeouts after one hour and some of the really slow downloads in my first analysis suffer from the very same issue. Maybe a scheduling issue or a matter of picking a scheduling algorithm that treats all streams more fairly?

comment:6 Changed 4 months ago by starlight

Cc: starlight@… added

comment:7 Changed 3 months ago by gaba

Cc: gaba added
Keywords: scalability added
Note: See TracTickets for help on using tickets.