Opened 7 years ago

Closed 6 years ago

#8235 closed defect (fixed)

pathbias_count_build_success()

Reported by: cypherpunks Owned by: mikeperry
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.10-alpha
Severity: Keywords: tor-client, MikePerry201304 024-backport
Cc: mikeperry Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

00:02:09 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (203.784282/203.421875) for guard ...
00:01:52 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (202.784282/202.421875) for guard ...
00:01:46 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (201.784282/201.421875) for guard ...
00:01:01 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (200.784282/200.421875) for guard ...
00:00:58 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (199.784282/199.421875) for guard ...
00:00:41 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (198.784282/198.421875) for guard ...

14:54:21 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (179.568564/158.843750) for guard ...
14:54:15 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (178.568564/158.843750) for guard ...
14:54:03 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (177.568564/156.843750) for guard ...
14:54:02 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (176.568564/155.843750) for guard ...
14:54:02 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (175.568564/155.843750) for guard ...
14:53:53 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (174.568564/154.843750) for guard ...

Tor Version: Tor-0.2.4.10-alpha.
This trend repeats with the bias increasing on both sides.
The guard did not change. I have removed it for security reasons. I'll provide if you need it, however.

Child Tickets

Change History (32)

comment:1 Changed 7 years ago by nickm

Cc: mikeperry added
Keywords: tor-client added
Milestone: Tor: 0.2.4.x-final
Priority: normalmajor

comment:2 Changed 7 years ago by mikeperry

Owner: set to mikeperry
Status: newassigned

FTR, I saw something similar while I was testing the path use bias code in 0.2.4.9-alpha. There was a source of overcounting present during scaling our counts in versions prior to 0.2.4.10-alpha that may have been what caused your high counts here. It is possible that the miscounting began while you were running those earlier versions.

If you want to start fresh, stop your tor daemon and run 'grep -v "EntryGuardPath" state > state.new ; mv state.new state' in your Tor's data directory, then start your tor daemon again. This will clear out the old path bias data.

If it happens again after that, we might need more general information about how you're using that Tor daemon (as a hidden service server? with a tor controller other than Vidalia? as a relay? with TBB-alpha? with a custom client app/script?) so we can try to reproduce here. But man, the state machine changes in tor 0.2.4.10 really should make it impossible to count the same circuit construction twice, and should have eliminated the error source during scaling from 0.2.4.9 and earlier...

comment:3 in reply to:  2 Changed 7 years ago by cypherpunks

Replying to mikeperry:

If you want to start fresh, stop your tor daemon and run 'grep -v "EntryGuardPath" state > state.new ; mv state.new state' in your Tor's data directory, then start your tor daemon again. This will clear out the old path bias data.

Hi mikeperry,
The issue has not shown itself again after following those steps. Here's hoping, eh? :)

comment:4 Changed 7 years ago by nickm

Resolution: worksforme
Status: assignedclosed

Here's hoping! Closing this as "worksforme" for now. Please reopen the ticket if it comes back. Thanks!

comment:5 Changed 7 years ago by cypherpunks

Resolution: worksforme
Status: closedreopened

Entirely different system but OP here:

17:57:21 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (298.962893/220.216796) for guard ...
17:47:38 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (288.734345/199.499510) for guard ...
17:46:03 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (287.734345/198.499510) for guard ...
17:40:13 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (297.962893/219.216796) for guard ...
17:40:07 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (296.962893/219.216796) for guard ...
17:39:55 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (295.962893/217.216796) for guard ...
17:39:54 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (294.962893/217.216796) for guard ...
17:39:54 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (286.734345/197.499510) for guard ...

This continued. The guard DID change this time, alternating between 3.
Problem has arisen on two different boxes, possibly a real issue now?

comment:6 Changed 7 years ago by mikeperry

You're sure this is from a clean state file with 0.2.4.10 for this other system, too?

If so, is there anything in common with these three systems that we can use to reproduce? I'm not seeing this on my systems.

comment:7 Changed 7 years ago by mikeperry

Another thing you could try is to see if you are getting any Bug log messages about pathbias. I believe most of them are at info level. You can enable them with a torrc line like:

Log [bug]info notice file /var/lib/tor/tor.log

comment:8 Changed 7 years ago by nickm

Status: reopenedassigned

mike, I think that if this happens at startup, we should just assume we have an old bad state file, clip the success count to the maximum, and not warn. That way if it happens later while Tor is running, we'll know that there's another bug. Sound plausible?

comment:9 Changed 6 years ago by mikeperry

This sounds like a good plan. I will have a branch soon.

comment:10 Changed 6 years ago by mikeperry

Status: assignedneeds_review

Pushed to mikeperry/bug8235-diagnostic (585c3108e6bddcc7a42750129cfc2392adf7e27c). I also added an additional check to the scaling, just in case there are still more issues hiding there.

Still untested as of yet; I am about to start testing it.

comment:11 Changed 6 years ago by nickm

This is going to want to get rebased back onto 0.2.4. Also the changes file should probably mention #8235. Also, "Changes File" is not a cool commit msg. Otherwise appears plausible. Let me know once it's tested.

comment:12 Changed 6 years ago by mikeperry

Keywords: MikePerry201304 added

If you are still doing tor_next branches, this might be a good candidate to throw on there just to see if anything obvious breaks. It's going to be a few days before I'll be able to test this thoroughly anyway.

comment:13 Changed 6 years ago by nickm

It's in the latest tor-next. It doesn't merge cleanly into 0.2.4, since it isn't based on 0.2.4. I don't know if there are people running those though. We should dogfood.

comment:14 Changed 6 years ago by andrea

Status: needs_reviewneeds_revision

Begin code review:

  • 2e1dd2d9d0ab11160ab6c45e6923551b50717027 looks fine to me
  • In c29da58ecb58f6513a4a3051144b285bb3653f3e, you're emitting log message if the counts become inconsistent after scaling but were consistent before (counts_are_sane && ...); if !counts_are_sane, are we certain to have already warned when that condition first occurred, or can this cause it to silently not log? Change this or say you're sure this is the case, I think.
  • 585c3108e6bddcc7a42750129cfc2392adf7e27c looks okay
  • 1afbe225eca4aaf21be1b60d7fc8204d628c28d6 looks okay

comment:15 in reply to:  14 Changed 6 years ago by mikeperry

Replying to andrea:

Begin code review:

  • 2e1dd2d9d0ab11160ab6c45e6923551b50717027 looks fine to me
  • In c29da58ecb58f6513a4a3051144b285bb3653f3e, you're emitting log message if the counts become inconsistent after scaling but were consistent before (counts_are_sane && ...); if !counts_are_sane, are we certain to have already warned when that condition first occurred, or can this cause it to silently not log? Change this or say you're sure this is the case, I think.

I don't believe the checks in this particular commit will silence any log messages. The changes to entrynodes.c in 2e1dd2d9 might silence some, but they are meant to cover the cases where previous tor versions had overcounting bugs that this one does not.

Should I perhaps promote 2e1dd2d9 to notice level? It might be nice to know if that keeps happening... I can add a version check in there for 0.2.4.10-alpha I guess (assuming that is easy).

Otherwise, I think these log messages will at least get us closer to solving this bug. Of course, I don't claim that any of these commits solve it for sure as of yet. (That's why the changes file is so gimpy and doesn't mention any solved bugs or versions).

comment:16 Changed 6 years ago by mikeperry

Status: needs_revisionneeds_review

Ok, I pushed a fixup commit that changes 2e1dd2d9 to emit a notice message if the state file was written by 0.2.4.12-alpha or above (on the assumption that this will get merged in time to make 0.2.4.12-alpha).

The idea is that if 0.2.4.12-alpha or above wrote the state file, it should have already clipped the overcounting issue when it started up, and if we still see evidence of overcounting in such newer statefiles, the overcounting bug is still hiding somewhere.

The fixup commit is 2d8f8b2f86aad22e2bbf6d110814feace5c03094 on mikeperry/bug8235-diagnostic.

comment:17 Changed 6 years ago by nickm

+        log_notice(LD_BUG, "Unexpectedly high use successes counts (%f/%f) "
+                 "for guard %s=%s",

I think this is the format we were moving away from. Other than that, looks okay to me. Will fix, rebase, merge.

comment:18 Changed 6 years ago by nickm

(Actually it looks like there are quite a few nickname=id entries there. I'll fix those in a separate commit.)

comment:19 Changed 6 years ago by nickm

Keywords: 024-backport added
Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final
Status: needs_reviewneeds_information

Squashed and rebased in nickm/bug8235-diagnostic-rebased ; merged into maint-0.2.4 and master.

Now that the diagnostic branch is in, I'm putting this in 0.2.5 as needs_information with a backport tag.

comment:20 Changed 6 years ago by mikeperry

Ok. We have to remember to update that entrynodes.c version check for whatever 0.2.4.x release we actually backport this into.

I would also prefer earlier than later of course, because these logs will clarify any more bug reports from people who are potentially using state files created before the bug was fixed in 0.2.4.10-alpha (like we suspect for this bug and #8580). Otherwise we're just going to have to dup those to this on the assumption that it was an old tor version that caused their issue.

comment:21 in reply to:  20 Changed 6 years ago by nickm

Replying to mikeperry:

Ok. We have to remember to update that entrynodes.c version check for whatever 0.2.4.x release we actually backport this into.

It has been merged into 0.2.4; no need to change the version. See above where I said "merged into maint-0.2.4". I had to rebase it, because it was built on master.

It's not the branch I just merged that is getting backported. The "024-backport" here means "If we find out that there is a fixable bug here, then consider backporting the fix".

comment:22 Changed 6 years ago by alphawolf

Was this supposed to be fixed as of 0.2.4.10?  I *think* that's what was said, but I got lost in all the "commits".  Anyway, I experienced this issue with version 0.2.4.11-alpha.  I stopped the tor daemon, ran 'grep -v "EntryGuardPath" state > state.new ; mv state.new state' in the tor data directory, and then started the tor daemon. After approximately 19.5 of uptime, I experienced the same message repeatedly in the logs. I am running as both a relay and a hidden service. I believe the service tends to have *extremely* long lived connections (days), if that makes a difference. In addition, the node accepts socks connections from the local network, and the software that is proxied also tends to have extremely long-lived circuits.

I have just upgraded to 0.2.4.12-alpha via the torproject repo for Debian Wheezy.  I did not reset the state file again before 0.2.4.12 started up.  Let me know if I should do so now, or just report here if the bug recurs regardless.

comment:23 Changed 6 years ago by alphawolf

Issue began again after 43 minutes of uptime with 0.2.4.12-alpha.  Excerpt of log below.  I left some of the (seemingly) extraneous log items in there in case they turn out to be relative.

Apr 19 05:36:45.000 [notice] pathbias_scale_close_rates(): Bug: Scaling has mangled pathbias counts to 164.500000/151.000000 (38/1 open) for guard...
Apr 19 05:36:48.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (165.500000/153.000000) for guard...
Apr 19 05:36:52.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (166.500000/155.000000) for guard...
Apr 19 05:36:53.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (167.500000/155.000000) for guard...
Apr 19 05:36:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
Apr 19 05:36:56.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (168.500000/156.000000) for guard...
Apr 19 05:36:57.000 [notice] Closing stream for '[scrubbed].onion': hidden service is unavailable (try again later).
Apr 19 05:37:21.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (169.500000/158.000000) for guard...
Apr 19 05:37:21.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (170.500000/158.000000) for guard...
Apr 19 05:37:27.000 [warn] Didn't recognize cell, but circ stops here! Closing circ.
Apr 19 05:37:27.000 [warn] circuit_receive_relay_cell (forward) failed. Closing.
Apr 19 05:37:36.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit ... at .... Retrying on a new circuit.
Apr 19 05:37:37.000 [notice] Closing stream for '[scrubbed].onion': hidden service is unavailable (try again later).
Apr 19 05:37:46.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (171.500000/159.000000) for guard...
Apr 19 05:37:49.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (172.500000/160.000000) for guard...
Apr 19 05:37:52.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (173.500000/161.000000) for guard...

comment:24 Changed 6 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final
Status: needs_informationnew

comment:25 Changed 6 years ago by nickm

This looks like the important line:

Apr 19 05:36:45.000 [notice] pathbias_scale_close_rates(): Bug: Scaling has mangled pathbias counts to 164.500000/151.000000 (38/1 open) for guard...

comment:26 Changed 6 years ago by mikeperry

Status: newneeds_review

Great! I was able to use those values to work through pathbias_scale_close_rates() on paper and spot the flaw. It should be fixed in mikeperry/bug8235-fix, but it would be nice if someone who can reproduce this easily can confirm. Here's how you build that branch:

git clone https://git.torproject.org/tor.git tor.git
cd tor.git
git remote add mikeperry https://git.torproject.org/mikeperry/tor.git
git fetch mikeperry
git branch --track bug8235-fix mikeperry/bug8235-fix
git checkout bug8235-fix
./configure
make && make install

comment:27 Changed 6 years ago by mikeperry

I just pushed another commit to bump that version check in entrynodes.c to 0.2.4.13-alpha, since we now know that the bug exists in 0.2.4.12-alpha, too (so no need to issue notices at startup for statefiles from then).

70552787aae864c4d3b69f7bc89428b82265ba03 is the new head of mikeperry/bug8235-fix with this commit.

comment:28 Changed 6 years ago by alphawolf

6 hours up, no error...

Is there some way to confirm this other than "wait"? If I pasted just the EntryGuardPathBias and EntryGuardPathUseBias lines from the state file, would you be able to tell if the scaling had occurred without producing an error?

comment:29 Changed 6 years ago by mikeperry

Well, assuming you stripped those out again, the one thing I would look for in your state file is pathbias counts with decimal values in them (which would indicate scaling).

Otherwise, you can do something like:

 Log [circ]info notice file /var/lib/tor/tor.log

in your torrc to log circuit activity. You can then grep your logs for "Scaled pathbias counts".

However, in both cases this will only tell you if scaling happened. Triggering the overcounting bug itself is a function of how close your success rate is to 100%, and how many open circuits you have at any given time.

comment:30 Changed 6 years ago by nickm

this branch is on master again.

comment:31 Changed 6 years ago by nickm

Other than that, looks okay. Rebasing and merging.

comment:32 Changed 6 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed
Note: See TracTickets for help on using tickets.