Opened 2 months ago

Closed 6 weeks ago

#34024 closed enhancement (implemented)

Reduce timeout and stallout values

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Onionperf Version:
Severity: Normal Keywords: metrics-team-roadmap-2020
Cc: irl, robgjansen, acute, jnewsome, metrics-team Actual Points: 0.2
Parent ID: #33321 Points: 1
Reviewer: acute Sponsor: Sponsor59-must

Description

On #33974 we discussed a suggestion to reduce timeouts for our three downloads as follows:

  • 50 KiB download with 15 seconds timeout rather than 295 seconds,
  • 1 MiB download with 60 seconds timeout rather than 1795 seconds, and
  • 5 MiB download with 120 seconds timeout rather than 3595 seconds.

Similarly, stallouts would be dropped entirely:

  • 50 KiB download with 0 seconds stallout rather than 300 seconds,
  • 1 MiB download with 0 seconds stallout rather than 1800 seconds, and
  • 5 MiB download with 0 seconds stallout rather than 3600 seconds.

After discussing this with irl we concluded that we might want to pick values somewhere in the middle. The smaller values above are being used by TGen for generating load for Shadow simulations, in that case it makes sense to use timeouts similar to how users would behave. But in the measurements we're doing with OnionPerf we can easily record more data even after a human user would have given up and later filter out measurements taking longer than whatever timeouts we want to use.

In particular, it would be important for us to use timeouts that are higher than timeouts used internally by the Tor client, so that we can observe what happens in those cases. Even if a human user would long have given up.

How about we use timeouts and stallouts close to 5 minutes, so that we avoid overlapping measurements? Like 270 seconds for all three download sizes? What would we use as stallout value here? 0?

Child Tickets

Change History (11)

comment:1 Changed 2 months ago by robgjansen

In TGen, timeouts are absolute times; the timer starts when the download starts, and it counts as an error if it is not completed by the configured number of seconds. Stallouts are configured as inter-byte-receive time, and only result in a stallout error if the last byte you received was >= the configured stallout time. (So if you set a stallout of 10 seconds, and you get at least 1 byte every second, the stallout error will not get triggered.)

It seems like you always want to give each download attempt the same amount of absolute time (i.e., timeout), and so I think you should disable the stallout by setting it to 0 (otherwise an internal default of 30 seconds is used).

https://github.com/shadow/tgen/blob/master/doc/TGen-Options.md#stream-options

comment:2 Changed 2 months ago by karsten

Ah, that makes sense. Thanks for the explanation!

However, I'm going to deploy the three new OnionPerf instances without changing weights or timeouts and leave them running for a while to see whether results are different from currently running instances. If they are not, we'll learn that something in the setup was different, and it would be good to keep the number of changes as small as possible. Coming back to this in May. Thanks for all the great input so far!

comment:3 Changed 2 months ago by gaba

Keywords: metrics-team-roadmap-2020 added
Points: 1
Sponsor: Sponsor59

comment:4 Changed 8 weeks ago by karsten

Sponsor: Sponsor59Sponsor59-must

Moving to Sponsor59-must, because we should really do these in order to call Sponsor59 done.

comment:5 Changed 8 weeks ago by jnewsome

Cc: jnewsome added

comment:6 Changed 7 weeks ago by karsten

Status: newneeds_review

I looked at measurements made by op-hk2, op-nl2, and op-us2 in the first half of May 2020. I wanted to get an idea how many of the 5 MiB measurements would have timed out with a timeout of 270 rather than 3600 seconds. The result was: zero. For comparison, there have been 1,027 successful 5 MiB measurements in that time, and the slowest four have finished in 201 seconds, 186 seconds, 142 seconds, and 128 seconds. And while it's conceivable that we would find a non-zero number in other half months of measurements, their number will very likely be really small.

Similar to #34023, let's discuss how to proceed here. My recommendation is that we:

  1. use a timeout of 270 seconds for all measurements to avoid overlapping measurements and
  1. set the stallout value to 0 to disable the stallout function in TGen.

More on Thursday at the weekly meeting.

comment:7 Changed 7 weeks ago by karsten

Parent ID: #33321

comment:8 Changed 6 weeks ago by karsten

Owner: changed from metrics-team to karsten
Status: needs_reviewaccepted

We discussed this at the weekly team meeting on Thursday and agreed that this is an improvement and that we're going to do it. I'm working on a patch now.

comment:9 Changed 6 weeks ago by karsten

Cc: metrics-team added
Status: acceptedneeds_review

comment:10 Changed 6 weeks ago by acute

Actual Points: 0.1
Reviewer: acute
Status: needs_reviewmerge_ready

The commit looks good and the transfer behaviour is as expected.
Adding 0.1 actual points and changing to merge_ready.

comment:11 Changed 6 weeks ago by karsten

Actual Points: 0.10.2
Resolution: implemented
Status: merge_readyclosed

Thanks for the review! Pushed to master, adding my 0.1 actual points, and closing.

Note: See TracTickets for help on using tickets.