Opened 3 months ago

Last modified 2 months ago

#23816 merge_ready defect

Use exponential backoff with jitter and/or tune its parameters

Reported by: nickm Owned by: nickm
Priority: Medium Milestone: Tor: 0.3.1.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords:
Cc: catalyst Actual Points:
Parent ID: Points:
Reviewer: catalyst Sponsor: SponsorV

Description

Child Tickets

Change History (14)

comment:1 Changed 3 months ago by nickm

Owner: set to nickm
Status: newaccepted

comment:2 Changed 3 months ago by nickm

They recommend one of two approaches. "Full Jitter":

sleep = random_between(0, min(cap, base * 2 ** attempt))

Or "Decorrelated Jitter":

sleep = min(cap, random_between(base, sleep * 3))

comment:3 Changed 3 months ago by nickm

Status: acceptedneeds_review

bug23816_029 has an implementation of the two suggested algorithms, for merge into 0.2.9 or later.

I don't know what the changes file should say here. :)

comment:4 Changed 3 months ago by nickm

Keywords: review-group-24 added

review-group-24 is now open.

comment:5 Changed 3 months ago by nickm

Sponsor: SponsorV

comment:6 Changed 3 months ago by catalyst

Reviewer: catalyst

comment:7 Changed 2 months ago by catalyst

Summary: Use expontial backoff with jitter and/or tune its parametersUse exponetial backoff with jitter and/or tune its parameters

I'm going to start by summarizing my comparison of the algorithms.

The existing algorithm is a weird hybrid of decorrelated jitter and equal jitter. Each delay is at least the size of the preceding one, plus a random value capped at a fixed multiple of the preceding delay. This means in the worst case, di+1 = di + di * 3 + 1 = di * 4 + 1, or approximately, di = 4i.

For full jitter, the worst case is di = base * mi.

For decorrelated jitter, the worst case is also di = base *mi but any small variation in an early delay step magnifies in later steps.

The examples in the blog show m=2 for full jitter and m=3 for decorrelated jitter. I'm not sure why they chose that variation.

The existing algorithm has some of the drawbacks of equal jitter, such as missing some possible time slots because it tries to always include a fixed delay component that's longer than the previous, so either decorrelated jitter or full jitter would be an improvement.

comment:8 Changed 2 months ago by catalyst

Summary: Use exponetial backoff with jitter and/or tune its parametersUse exponential backoff with jitter and/or tune its parameters

comment:9 Changed 2 months ago by catalyst

rough estimate of mean delays:

  • existing: di = base * ((m+1)/2)i [ m = 3 regular; m = 2 testnet ]
  • full jitter: di = base * mi/2 [ m = 2 per AWS blog ]
  • decorrelated jitter: di = base * (m/2)i [ m = 3 per AWS blog ]

It looks like going to decorrelated jitter will cause the mean delays to increase more slowly than they already are, which seems like a helpful thing.

comment:10 Changed 2 months ago by catalyst

Status: needs_reviewneeds_revision

I see that there is no longer a different multiplier between the test network case and the regular case. That's probably OK given that it looks like the mean and worst-case delay profiles become shorter after this change.

The functions would require fewer parameters and be more readable if we only implement one of the jittered backoff algorithms (probably decorrelated jitter). It would also avoid the large #ifdef blocks.

If someone could check that the math on my estimates above look reasonable that would be good too.

comment:11 Changed 2 months ago by nickm

It seems that the test-network variant was added in 38e3f91c6388bc676c0007b00948, for #20597.

Based on the reasoning there, I think that it shouldn't be necessary to have that variant any more: previously, the maximum delay increase was 4x for production, 3x for test. Now, it's 3x for everyone: so it shouldn't make testnets any worse.

I've pushed an updated bug23816_029, with fixup commits to remove the "full jitter" algorithm.

comment:12 Changed 2 months ago by nickm

Status: needs_revisionneeds_review

comment:13 Changed 2 months ago by catalyst

Status: needs_reviewmerge_ready

Thanks! The updated patches look good to me. There are few minor nits with spacing around operators (e.g., INT_MAX/3, base_delay+1, low_bound=0, high_bound=max_delay) that you may wish to fix before merging, but they're probably not worth another round of review.

I wrote some rough simulations in Python and they do show a significant improvement in terms of delays growing less quickly with the new algorithm.

comment:14 Changed 2 months ago by nickm

Keywords: review-group-24 removed
Milestone: Tor: 0.3.2.x-finalTor: 0.3.1.x-final

Okay! I've made a new bug23816_029_squashed with the fixup commits squashed, and I've merged it to maint-0.3.2 and forward. The merge commit was e5a83062ed4e9e6b908efc6b75f13ab269f97377 -- we needed a little hacking to get the tests to pass again. Marking for possible backport.

Note: See TracTickets for help on using tickets.