Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#6647 closed defect (fixed)

pathbias_get_scale_factor() is using the wrong default and also causes rounding errors

Reported by: mikeperry Owned by: mikeperry
Priority: High Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: MikePerry201208 023backport tor-client
Cc: Actual Points: 4
Parent ID: Points: 4
Reviewer: Sponsor:

Description

pathbias_get_scale_factor() returns DFLT_PATH_BIAS_SCALE_THRESHOLD instead of DFLT_PATH_BIAS_SCALE_FACTOR as the default value.

This mixup means that every time we hit DFLT_PATH_BIAS_SCALE_THRESHOLD worth of path bias data, we basically totally reset our counts.. Woops.

Also, my analysis from #6135 makes me think I did this whole scaling thing wrong. It should be fixed-point decimal value that we multiply with instead of a divisor. This will allow us more flexibility, but it will also allow us to easily approximate a moving average, which is I think probably what we really want. See #6135 and links for details.

Child Tickets

Change History (15)

comment:1 Changed 7 years ago by mikeperry

Actual Points: 2
Summary: pathbias_get_scale_factor() needs to be redonepathbias_get_scale_factor() is using the wrong default

mikeperry/bug6647. Based off of latest mikeperry/bug6475-noguards for ease of testing.

comment:2 Changed 7 years ago by nickm

Status: newneeds_revision

Needs a changes file.

Can't merge, because I can't merge bug6475-noguards yet.

For 0.2.3, I'd much prefer a smaller change that only fixed the actual bug. This release is supposed to be getting stable, after all -- like, months ago.

comment:3 in reply to:  2 Changed 7 years ago by mikeperry

Replying to nickm:

Needs a changes file.

Can't merge, because I can't merge bug6475-noguards yet.

For 0.2.3, I'd much prefer a smaller change that only fixed the actual bug. This release is supposed to be getting stable, after all -- like, months ago.

Is changing it to a double that complex of a change? The divisor also has other problems such as rounding error issues... It's also easy to see that the double is working (which it has on my test setup for around 1000 circuits now).

comment:4 Changed 7 years ago by mikeperry

FYI: On my amped-up hidden-service based test setup, the new scaling code from this patch has now been exercised 800 times (over a total of ~7000 circuits). The path bias counts are fairly stable for each guard, and there have been no incidents of any log messages from #6475.

comment:5 Changed 7 years ago by mikeperry

Man, last night I hit a rounding error in both the old divisor code *and* this new code. It triggered the "Unexpected success count" log from #6475 in both cases (but I demoted it to notice in the new code). It looks like some other random user just reported a similar case in #6475 last night as well.

I know you're about to stab me Nick, but I will work to fix these logs, and I'm still thinking it is better to do it now on code people are using rather than waiting 6 months for people to start using 0.2.4.x as a client. Hopefully I will eventually assuage your buyer's remorse before it turns into buyer's murderous rage.

comment:6 Changed 7 years ago by mikeperry

Summary: pathbias_get_scale_factor() is using the wrong defaultpathbias_get_scale_factor() is using the wrong default and also causes rounding errors

Ok, I've thought about this for a while. I see two ways forward on this front:

  1. We can go back to the divisor, and only scale 'if ((first_hop % divisor) == 0 && (success_count % divisor) == 0)'. For this route, a default divisor of 2 makes the most sense, I think.
  1. We store a decimal value for the PathBias rate in the state file and actually do a moving average for real instead of screwing around in integer land.

I think my stabby sense is telling me Nick would prefer the modulus route for 0.2.3.x for now, since it is the least amount of code change from the original implementation. I think we should also use a divisor of 2 instead of 4. That way, if both values of success_count and first_hop are even, we scale. Otherwise, we wait until they are both even.

Again, I'm really sorry my fast-and-loose approach to software development has exposed you to so much noise on this front, Nick... I promise to do the floating point implementation and write a real proposal for the feature for 0.2.4.x.

comment:7 in reply to:  5 Changed 7 years ago by nickm

Nick does not currently expect to merge any fix for this whole mess in 0.2.3 other than "downgrade the warnings."

comment:8 Changed 7 years ago by mikeperry

Mike currently has a fix for this bug and the log message issues in the rebased version of mikeperry/6647. I wiped the logs before running that code, and since then, it has survived 30000 hidden service side circuits, 60000 client side circuits, and 1000 pathbias_get_scale_factor() calls without a single LD_BUG log message from #6475.

comment:9 Changed 7 years ago by nickm

I'll merge this into 0.2.4.x once it has a changes file. Please write changes files!

comment:10 Changed 7 years ago by mikeperry

Pushed a changes file on to mikeperry/bug6647.

Is there a backport plan after it sits on 0.2.4.x for a while? It's pretty clearly an obvious 0.2.3.x bug.

comment:11 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_revisionclosed

Merged bug6647 into master.

The backport plan is to release and stabilize 0.2.4.x as fast as possible so 0.2.3.x is the stable release for not too long.

comment:12 Changed 7 years ago by nickm

Keywords: 023backport added

comment:13 Changed 7 years ago by mikeperry

Actual Points: 24
Points: 4

Current count is 76k client side circuits, 42k hidden service side circuits, 1500 scaling events, and 0 LD_BUG log messages.

I'm going to stop sapping CPU from the Tor network now, since it seems clear my stress testing is not going to hit any log messages.

comment:14 Changed 7 years ago by nickm

Keywords: tor-client added

comment:15 Changed 7 years ago by nickm

Component: Tor ClientTor
Note: See TracTickets for help on using tickets.