pathbias_count_build_success()

changed milestone to %Tor: 0.2.4.x-final

added 024-backport MikePerry201304 component::core tor/tor milestone::Tor: 0.2.4.x-final owner::mikeperry priority::high resolution::fixed status::closed tor-client type::defect version::tor 0.2.4.10-alpha labels

Trac:
Cc: N/A to mikeperry
Milestone: N/A to Tor: 0.2.4.x-final
Priority: normal to major
Keywords: N/A deleted, tor-client added

FTR, I saw something similar while I was testing the path use bias code in 0.2.4.9-alpha. There was a source of overcounting present during scaling our counts in versions prior to 0.2.4.10-alpha that may have been what caused your high counts here. It is possible that the miscounting began while you were running those earlier versions.

If you want to start fresh, stop your tor daemon and run 'grep -v "EntryGuardPath" state > state.new ; mv state.new state' in your Tor's data directory, then start your tor daemon again. This will clear out the old path bias data.

If it happens again after that, we might need more general information about how you're using that Tor daemon (as a hidden service server? with a tor controller other than Vidalia? as a relay? with TBB-alpha? with a custom client app/script?) so we can try to reproduce here. But man, the state machine changes in tor 0.2.4.10 really should make it impossible to count the same circuit construction twice, and should have eliminated the error source during scaling from 0.2.4.9 and earlier...

Trac:
Status: new to assigned
Owner: N/A to mikeperry

Replying to mikeperry:

If you want to start fresh, stop your tor daemon and run 'grep -v "EntryGuardPath" state > state.new ; mv state.new state' in your Tor's data directory, then start your tor daemon again. This will clear out the old path bias data.

Hi mikeperry, The issue has not shown itself again after following those steps. Here's hoping, eh? :)

Here's hoping! Closing this as "worksforme" for now. Please reopen the ticket if it comes back. Thanks!

Trac:
Status: assigned to closed
Resolution: N/A to worksforme

Entirely different system but OP here:

17:57:21 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (298.962893/220.216796) for guard ... 17:47:38 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (288.734345/199.499510) for guard ... 17:46:03 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (287.734345/198.499510) for guard ... 17:40:13 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (297.962893/219.216796) for guard ... 17:40:07 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (296.962893/219.216796) for guard ... 17:39:55 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (295.962893/217.216796) for guard ... 17:39:54 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (294.962893/217.216796) for guard ... 17:39:54 [NOTICE] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (286.734345/197.499510) for guard ...

This continued. The guard DID change this time, alternating between 3. Problem has arisen on two different boxes, possibly a real issue now?

Trac:
Status: closed to reopened
Resolution: worksforme to N/A

You're sure this is from a clean state file with 0.2.4.10 for this other system, too?

If so, is there anything in common with these three systems that we can use to reproduce? I'm not seeing this on my systems.

Another thing you could try is to see if you are getting any Bug log messages about pathbias. I believe most of them are at info level. You can enable them with a torrc line like:

Log [bug]info notice file /var/lib/tor/tor.log

mike, I think that if this happens at startup, we should just assume we have an old bad state file, clip the success count to the maximum, and not warn. That way if it happens later while Tor is running, we'll know that there's another bug. Sound plausible?

Trac:
Status: reopened to assigned

This sounds like a good plan. I will have a branch soon.

Pushed to mikeperry/bug8235-diagnostic (585c3108e6bddcc7a42750129cfc2392adf7e27c). I also added an additional check to the scaling, just in case there are still more issues hiding there.

Still untested as of yet; I am about to start testing it.

Trac:
Status: assigned to needs_review

This is going to want to get rebased back onto 0.2.4. Also the changes file should probably mention #8235 (moved). Also, "Changes File" is not a cool commit msg. Otherwise appears plausible. Let me know once it's tested.

If you are still doing tor_next branches, this might be a good candidate to throw on there just to see if anything obvious breaks. It's going to be a few days before I'll be able to test this thoroughly anyway.

Trac:
Keywords: N/A deleted, MikePerry201304 added

It's in the latest tor-next. It doesn't merge cleanly into 0.2.4, since it isn't based on 0.2.4. I don't know if there are people running those though. We should dogfood.

Begin code review:

2e1dd2d9d0ab11160ab6c45e6923551b50717027 looks fine to me
In c29da58ecb58f6513a4a3051144b285bb3653f3e, you're emitting log message if the counts become inconsistent after scaling but were consistent before (counts_are_sane && ...); if !counts_are_sane, are we certain to have already warned when that condition first occurred, or can this cause it to silently not log? Change this or say you're sure this is the case, I think.
585c3108e6bddcc7a42750129cfc2392adf7e27c looks okay
1afbe225eca4aaf21be1b60d7fc8204d628c28d6 looks okay

Trac:
Status: needs_review to needs_revision

Replying to andrea:

Begin code review:

2e1dd2d9d0ab11160ab6c45e6923551b50717027 looks fine to me

In c29da58ecb58f6513a4a3051144b285bb3653f3e, you're emitting log message if the counts become inconsistent after scaling but were consistent before (counts_are_sane && ...); if !counts_are_sane, are we certain to have already warned when that condition first occurred, or can this cause it to silently not log? Change this or say you're sure this is the case, I think.

I don't believe the checks in this particular commit will silence any log messages. The changes to entrynodes.c in 2e1dd2d9 might silence some, but they are meant to cover the cases where previous tor versions had overcounting bugs that this one does not.

Should I perhaps promote 2e1dd2d9 to notice level? It might be nice to know if that keeps happening... I can add a version check in there for 0.2.4.10-alpha I guess (assuming that is easy).

Otherwise, I think these log messages will at least get us closer to solving this bug. Of course, I don't claim that any of these commits solve it for sure as of yet. (That's why the changes file is so gimpy and doesn't mention any solved bugs or versions).

Ok, I pushed a fixup commit that changes 2e1dd2d9 to emit a notice message if the state file was written by 0.2.4.12-alpha or above (on the assumption that this will get merged in time to make 0.2.4.12-alpha).

The idea is that if 0.2.4.12-alpha or above wrote the state file, it should have already clipped the overcounting issue when it started up, and if we still see evidence of overcounting in such newer statefiles, the overcounting bug is still hiding somewhere.

The fixup commit is 2d8f8b2f86aad22e2bbf6d110814feace5c03094 on mikeperry/bug8235-diagnostic.

Trac:
Status: needs_revision to needs_review

+        log_notice(LD_BUG, "Unexpectedly high use successes counts (%f/%f) "
+                 "for guard %s=%s",

I think this is the format we were moving away from. Other than that, looks okay to me. Will fix, rebase, merge.

(Actually it looks like there are quite a few nickname=id entries there. I'll fix those in a separate commit.)

Squashed and rebased in nickm/bug8235-diagnostic-rebased ; merged into maint-0.2.4 and master.

Now that the diagnostic branch is in, I'm putting this in 0.2.5 as needs_information with a backport tag.

Trac:
Status: needs_review to needs_information
Milestone: Tor: 0.2.4.x-final to Tor: 0.2.5.x-final
Keywords: MikePerry201304 deleted, MikePerry201304 024-backport added

Ok. We have to remember to update that entrynodes.c version check for whatever 0.2.4.x release we actually backport this into.

I would also prefer earlier than later of course, because these logs will clarify any more bug reports from people who are potentially using state files created before the bug was fixed in 0.2.4.10-alpha (like we suspect for this bug and #8580 (moved)). Otherwise we're just going to have to dup those to this on the assumption that it was an old tor version that caused their issue.

Replying to mikeperry:

Ok. We have to remember to update that entrynodes.c version check for whatever 0.2.4.x release we actually backport this into.

It has been merged into 0.2.4; no need to change the version. See above where I said "merged into maint-0.2.4". I had to rebase it, because it was built on master.

It's not the branch I just merged that is getting backported. The "024-backport" here means "If we find out that there is a fixable bug here, then consider backporting the fix".

Was this supposed to be fixed as of 0.2.4.10? I think that's what was said, but I got lost in all the "commits". Anyway, I experienced this issue with version 0.2.4.11-alpha. I stopped the tor daemon, ran 'grep -v "EntryGuardPath" state > state.new ; mv state.new state' in the tor data directory, and then started the tor daemon. After approximately 19.5 of uptime, I experienced the same message repeatedly in the logs. I am running as both a relay and a hidden service. I believe the service tends to have extremely long lived connections (days), if that makes a difference. In addition, the node accepts socks connections from the local network, and the software that is proxied also tends to have extremely long-lived circuits.

I have just upgraded to 0.2.4.12-alpha via the torproject repo for Debian Wheezy. I did not reset the state file again before 0.2.4.12 started up. Let me know if I should do so now, or just report here if the bug recurs regardless.

Trac:
Username: alphawolf

Issue began again after 43 minutes of uptime with 0.2.4.12-alpha. Excerpt of log below. I left some of the (seemingly) extraneous log items in there in case they turn out to be relative.

Apr 19 05:36:45.000 [notice] pathbias_scale_close_rates(): Bug: Scaling has mangled pathbias counts to 164.500000/151.000000 (38/1 open) for guard...
Apr 19 05:36:48.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (165.500000/153.000000) for guard...
Apr 19 05:36:52.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (166.500000/155.000000) for guard...
Apr 19 05:36:53.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (167.500000/155.000000) for guard...
Apr 19 05:36:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
Apr 19 05:36:56.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (168.500000/156.000000) for guard...
Apr 19 05:36:57.000 [notice] Closing stream for '[scrubbed].onion': hidden service is unavailable (try again later).
Apr 19 05:37:21.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (169.500000/158.000000) for guard...
Apr 19 05:37:21.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (170.500000/158.000000) for guard...
Apr 19 05:37:27.000 [warn] Didn't recognize cell, but circ stops here! Closing circ.
Apr 19 05:37:27.000 [warn] circuit_receive_relay_cell (forward) failed. Closing.
Apr 19 05:37:36.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit ... at .... Retrying on a new circuit.
Apr 19 05:37:37.000 [notice] Closing stream for '[scrubbed].onion': hidden service is unavailable (try again later).
Apr 19 05:37:46.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (171.500000/159.000000) for guard...
Apr 19 05:37:49.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (172.500000/160.000000) for guard...
Apr 19 05:37:52.000 [notice] pathbias_count_build_success(): Bug: Unexpectedly high successes counts (173.500000/161.000000) for guard...

Trac:
Username: alphawolf

Trac:
Milestone: Tor: 0.2.5.x-final to Tor: 0.2.4.x-final
Status: needs_information to new

This looks like the important line:

Apr 19 05:36:45.000 [notice] pathbias_scale_close_rates(): Bug: Scaling has mangled pathbias counts to 164.500000/151.000000 (38/1 open) for guard...

Great! I was able to use those values to work through pathbias_scale_close_rates() on paper and spot the flaw. It should be fixed in mikeperry/bug8235-fix, but it would be nice if someone who can reproduce this easily can confirm. Here's how you build that branch:

git clone https://git.torproject.org/tor.git tor.git
cd tor.git
git remote add mikeperry https://git.torproject.org/mikeperry/tor.git
git fetch mikeperry
git branch --track bug8235-fix mikeperry/bug8235-fix
git checkout bug8235-fix
./configure
make && make install

Trac:
Status: new to needs_review

I just pushed another commit to bump that version check in entrynodes.c to 0.2.4.13-alpha, since we now know that the bug exists in 0.2.4.12-alpha, too (so no need to issue notices at startup for statefiles from then).

70552787aae864c4d3b69f7bc89428b82265ba03 is the new head of mikeperry/bug8235-fix with this commit.

6 hours up, no error...

Is there some way to confirm this other than "wait"? If I pasted just the EntryGuardPathBias and EntryGuardPathUseBias lines from the state file, would you be able to tell if the scaling had occurred without producing an error?

Trac:
Username: alphawolf

Well, assuming you stripped those out again, the one thing I would look for in your state file is pathbias counts with decimal values in them (which would indicate scaling).

Otherwise, you can do something like:

 Log [circ]info notice file /var/lib/tor/tor.log

in your torrc to log circuit activity. You can then grep your logs for "Scaled pathbias counts".

However, in both cases this will only tell you if scaling happened. Triggering the overcounting bug itself is a function of how close your success rate is to 100%, and how many open circuits you have at any given time.

this branch is on master again.

Other than that, looks okay. Rebasing and merging.

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

closed

mentioned in issue #8580 (moved)

mentioned in issue #8858 (moved)

mentioned in issue #8859 (moved)

mentioned in issue #8872 (moved)

mentioned in issue #8967 (moved)

mentioned in issue #30085 (moved)

moved to tpo/core/tor#8235 (closed)

pathbias_count_build_success()

Child items 0

Activity