Opened 2 years ago

Closed 15 months ago

#23816 closed defect (fixed)

Use exponential backoff with jitter and/or tune its parameters

Reported by: nickm Owned by: nickm
Priority: Medium Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 031-backport-maybe, 029-backport-maybe
Cc: catalyst Actual Points:
Parent ID: Points:
Reviewer: catalyst Sponsor: SponsorV

Description

Child Tickets

Change History (17)

comment:1 Changed 2 years ago by nickm

Owner: set to nickm
Status: newaccepted

comment:2 Changed 2 years ago by nickm

They recommend one of two approaches. "Full Jitter":

sleep = random_between(0, min(cap, base * 2 ** attempt))

Or "Decorrelated Jitter":

sleep = min(cap, random_between(base, sleep * 3))

comment:3 Changed 2 years ago by nickm

Status: acceptedneeds_review

bug23816_029 has an implementation of the two suggested algorithms, for merge into 0.2.9 or later.

I don't know what the changes file should say here. :)

comment:4 Changed 2 years ago by nickm

Keywords: review-group-24 added

review-group-24 is now open.

comment:5 Changed 2 years ago by nickm

Sponsor: SponsorV

comment:6 Changed 23 months ago by catalyst

Reviewer: catalyst

comment:7 Changed 23 months ago by catalyst

Summary: Use expontial backoff with jitter and/or tune its parametersUse exponetial backoff with jitter and/or tune its parameters

I'm going to start by summarizing my comparison of the algorithms.

The existing algorithm is a weird hybrid of decorrelated jitter and equal jitter. Each delay is at least the size of the preceding one, plus a random value capped at a fixed multiple of the preceding delay. This means in the worst case, di+1 = di + di * 3 + 1 = di * 4 + 1, or approximately, di = 4i.

For full jitter, the worst case is di = base * mi.

For decorrelated jitter, the worst case is also di = base *mi but any small variation in an early delay step magnifies in later steps.

The examples in the blog show m=2 for full jitter and m=3 for decorrelated jitter. I'm not sure why they chose that variation.

The existing algorithm has some of the drawbacks of equal jitter, such as missing some possible time slots because it tries to always include a fixed delay component that's longer than the previous, so either decorrelated jitter or full jitter would be an improvement.

comment:8 Changed 23 months ago by catalyst

Summary: Use exponetial backoff with jitter and/or tune its parametersUse exponential backoff with jitter and/or tune its parameters

comment:9 Changed 23 months ago by catalyst

rough estimate of mean delays:

  • existing: di = base * ((m+1)/2)i [ m = 3 regular; m = 2 testnet ]
  • full jitter: di = base * mi/2 [ m = 2 per AWS blog ]
  • decorrelated jitter: di = base * (m/2)i [ m = 3 per AWS blog ]

It looks like going to decorrelated jitter will cause the mean delays to increase more slowly than they already are, which seems like a helpful thing.

comment:10 Changed 23 months ago by catalyst

Status: needs_reviewneeds_revision

I see that there is no longer a different multiplier between the test network case and the regular case. That's probably OK given that it looks like the mean and worst-case delay profiles become shorter after this change.

The functions would require fewer parameters and be more readable if we only implement one of the jittered backoff algorithms (probably decorrelated jitter). It would also avoid the large #ifdef blocks.

If someone could check that the math on my estimates above look reasonable that would be good too.

comment:11 Changed 23 months ago by nickm

It seems that the test-network variant was added in 38e3f91c6388bc676c0007b00948, for #20597.

Based on the reasoning there, I think that it shouldn't be necessary to have that variant any more: previously, the maximum delay increase was 4x for production, 3x for test. Now, it's 3x for everyone: so it shouldn't make testnets any worse.

I've pushed an updated bug23816_029, with fixup commits to remove the "full jitter" algorithm.

comment:12 Changed 23 months ago by nickm

Status: needs_revisionneeds_review

comment:13 Changed 23 months ago by catalyst

Status: needs_reviewmerge_ready

Thanks! The updated patches look good to me. There are few minor nits with spacing around operators (e.g., INT_MAX/3, base_delay+1, low_bound=0, high_bound=max_delay) that you may wish to fix before merging, but they're probably not worth another round of review.

I wrote some rough simulations in Python and they do show a significant improvement in terms of delays growing less quickly with the new algorithm.

comment:14 Changed 23 months ago by nickm

Keywords: review-group-24 removed
Milestone: Tor: 0.3.2.x-finalTor: 0.3.1.x-final

Okay! I've made a new bug23816_029_squashed with the fixup commits squashed, and I've merged it to maint-0.3.2 and forward. The merge commit was e5a83062ed4e9e6b908efc6b75f13ab269f97377 -- we needed a little hacking to get the tests to pass again. Marking for possible backport.

comment:15 Changed 20 months ago by nickm

I am leaning "no backport" on this one because the old behavior, though poor, wasn't actually breaking anything that I'm aware of.... and because the changes are larger than I'd usually like to backport without a compelling bug. Please let me know if I'm wrong there.

comment:16 Changed 16 months ago by teor

Keywords: 031-backport-maybe 029-backport-maybe added

comment:17 Changed 15 months ago by nickm

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final
Resolution: fixed
Status: merge_readyclosed

No backport, per reasons discussed above.

Note: See TracTickets for help on using tickets.