I'm seeing this "low circuit success rate" warning on my client sometimes when on a flaky wireless network connection.
I'm guessing that the problem is that we're blaming the guard when we should in fact be blaming the network.
We should a) figure out how to improve the log message so we get more hints about where our bug is, and/or b) just straight-out hunt for it.
Mike suggests raising the number of circuits we have to make before we'll spit out the warning. I expect that will mask the bug, but not find or resolve it.
If there is a bug with how we count circuits, what is it about changing the minimum number of circuits that could possibly change this bug?
Also, were you using weird torrc options, testing hidden services, or doing anything atypical? I seem to recall wacky torrc options of yours causing problems in the past.
One thing we could do is record what fraction of the failures were timeouts in the log line, as well as the current CBT 'right-censored' timeout value (which the path bias stuff uses instead of the 80% timeout value). These might be useful datapoints from people who only give us the one logline and then disappear..
One thing we could do is record what fraction of the failures were timeouts in the log line, as well as the current CBT 'right-censored' timeout value (which the path bias stuff uses instead of the 80% timeout value). These might be useful datapoints from people who only give us the one logline and then disappear..
I have a branch with these log updates in mikeperry/209-path-bias-changes. I also updated the proposal to match in my torspec remote mikeperry/path-bias-tuning.
pathbias_count_timeout() should be declared in circuitbuild.h, not or.h. We shouldn't be declaring functions in or.h any more.
"crited" isn't a word; what do you mean it to mean? If "crit" is the verb, I think "critted" is the past participle, but...?
You can't take something out of or_options_t without adding an OBSOLETE() entry for it, or an alias for it in the aliases table. Otherwise every torrc that used it would suddenly break.
All functions and fields and options need documentation. I shouldn't have to go read the proposal to find out what a "mult factor" is; the comments and the manpage should explain it.
I have a hard time understanding why the code does "%" in "((mult_factor * foo) % scale_factor) == 0". I guess that you're trying to avoid integer truncation, but won't this prevent scaling entirely? Suppose scale_factor is 100. It seems to me that unless first_hops and circuit_successes are both divisible by 100 at the same time, they'll never get scaled. Can't we just reserve scaling for when the values are "big enough" that integer truncation won't matter?
LD_PROTOCOL might not be appropriate here; that's for protocol violations, not for malfunctioning nodes.
The "your guard %s=%s" code should probably get refactored to use the same format as node_describe , and probably to be an independent function.
I have a hard time understanding why the code does "%" in "((mult_factor * foo) % scale_factor) == 0". I guess that you're trying to avoid integer truncation, but won't this prevent scaling entirely? Suppose scale_factor is 100. It seems to me that unless first_hops and circuit_successes are both divisible by 100 at the same time, they'll never get scaled. Can't we just reserve scaling for when the values are "big enough" that integer truncation won't matter?
In my testing, any amount of integer truncation eventually led to the accumulation of enough roundoff error for us to either overcount or undercount. It was only a matter of time for larger values.
I commented the scale_factor and mult_factor functions to be more clear about the fact that we only scale if there is no integer truncation, and that this means we should be careful when changing those parameters.
The "your guard %s=%s" code should probably get refactored to use the same format as node_describe , and probably to be an independent function.
I looked at node_describe and router_describe, and they seemed overkill for this. Both functions output a bunch of info we simply do not have available for entry_guard_t (since we might not have a node_t or a routerinfo_t for the guard).. Effectively I'd be writing a custom utility function just to do "%s=%s" in this case, which seemed silly.
Everything else you mentioned has been pushed to that branch. I'll try to test a build with this code and #3443 (moved) soon.
Trac: Status: needs_revision to needs_review Type: defect to enhancement Keywords: N/Adeleted, MikePerry201210 added Actualpoints: N/Ato 8
Commenting here to checkpoint my progress for tomorrow.
I appear to have found at least one bug in testing so far: There appears to be a discrepancy between the PathBias timeout counting and the CBT right-censored CircuitBuildAbandonedCount values..
I think the error is in circuit_expire_building() and the difference between the calls to circuit_build_times_count_close() and pathbias_count_timeout().. I still haven't fully conceptualized all of the various cases of circuit_expire_building() though. Man is it a hairy mess. :/
Also just hit: src/or/networkstatus.c:2216 get_net_param_from_list: Assertion max_val > min_val failed; aborting.
It looks like get_net_param_from_list() should allow min_val == max_val based on its codepaths, but I'll go ahead and change my code instead. Easier than filing a new bug+changelog entry.
I appear to have found at least one bug in testing so far: There appears to be a discrepancy between the PathBias timeout counting and the CBT right-censored CircuitBuildAbandonedCount values..
I think the error is in circuit_expire_building() and the difference between the calls to circuit_build_times_count_close() and pathbias_count_timeout().. I still haven't fully conceptualized all of the various cases of circuit_expire_building() though. Man is it a hairy mess. :/
This discrepancy turns out to be due to the hidden service cases in that function. There is no discrepancy for normal exit circuits.. It also is likely the source of potentially deeper issues with proper timeout usage for hidden service circuits...
In my testing, any amount of integer truncation eventually led to the accumulation of enough roundoff error for us to either overcount or undercount. It was only a matter of time for larger values.
I commented the scale_factor and mult_factor functions to be more clear about the fact that we only scale if there is no integer truncation, and that this means we should be careful when changing those parameters.
Hm. I am still not comfortable about this. Can we think about what this implies for possible values for those scaling factors? It seems possible (though unlikely) to wind up with a sequence of events in which we can never scale downwards, even with scale_factor == 2. If you're worried about integer roundoff, mightn't the answer be instead to make sure that the scale threshold is high, and to perform rounding rather than truncating division?
This discrepancy turns out to be due to the hidden service cases in that function. There is no discrepancy for normal exit circuits.. It also is likely the source of potentially deeper issues with proper timeout usage for hidden service circuits...
Ouch. So I'm assuming this is in "don't merge it right now" state then? Could there be some way to re-think the design so that we can't overcount/undercount? Perhaps to have a flag that's set the first time we count something or render it uncountable, so we can't double-count it? Or is this not an overcounting/undercounting thing?
In my testing, any amount of integer truncation eventually led to the accumulation of enough roundoff error for us to either overcount or undercount. It was only a matter of time for larger values.
I commented the scale_factor and mult_factor functions to be more clear about the fact that we only scale if there is no integer truncation, and that this means we should be careful when changing those parameters.
Hm. I am still not comfortable about this. Can we think about what this implies for possible values for those scaling factors? It seems possible (though unlikely) to wind up with a sequence of events in which we can never scale downwards, even with scale_factor == 2. If you're worried about integer roundoff, mightn't the answer be instead to make sure that the scale threshold is high, and to perform rounding rather than truncating division?
Hrmm.. I'm pretty sure rounding will only reduce the error.. It can still accumulate.. Bleh. I'm not seeing any clear way forward here, save for switching everything to floating point. :/
This discrepancy turns out to be due to the hidden service cases in that function. There is no discrepancy for normal exit circuits.. It also is likely the source of potentially deeper issues with proper timeout usage for hidden service circuits...
Ouch. So I'm assuming this is in "don't merge it right now" state then? Could there be some way to re-think the design so that we can't overcount/undercount? Perhaps to have a flag that's set the first time we count something or render it uncountable, so we can't double-count it? Or is this not an overcounting/undercounting thing?
Yeah, so the discrepancy I'm talking about wrt hidden services is not for the defense. It's for the informational "You've had these many failures count as timeouts" part of the log. The issue is in how circuit_expire_building deals with CBT. It's probably actually a separate bug.
The actual success counts are already enforced with such a state flag.
However, yeah, I'm still testing with this branch. Don't merge it just yet.
I created #7341 (moved) for the timeout issue. Any magical insight on the truncation problem? Should we just switch to double-precision floating point for the state file counts? Or maybe we should use a real (weighted?) moving average and simply store a count and a double rate value, and publish some kind of weight in the consensus?
I am worried that the pattern Nick mentions is technically exploitable by guards to prevent scaling. I don't know if I should be more worried about that, or about the ability to ensure accuracy..
Trac: Keywords: MikePerry201210 deleted, MikePerry201211 added Status: needs_review to needs_revision
I created #7341 (moved) for the timeout issue. Any magical insight on the truncation problem? Should we just switch to double-precision floating point for the state file counts? Or maybe we should use a real (weighted?) moving average and simply store a count and a double rate value, and publish some kind of weight in the consensus?
Hm. I think that using doubles would be okay IMO. It sounds less error-prone than most of the other approaches. Does it sound less error-prone to you?
I created #7341 (moved) for the timeout issue. Any magical insight on the truncation problem? Should we just switch to double-precision floating point for the state file counts? Or maybe we should use a real (weighted?) moving average and simply store a count and a double rate value, and publish some kind of weight in the consensus?
Hm. I think that using doubles would be okay IMO. It sounds less error-prone than most of the other approaches. Does it sound less error-prone to you?
Less error-prone to get to from what we've got, yes. However, I am still tempted to change it to an EWMA of some kind, because that's more typical and less crazy than this scaling idea, which I think only makes sense if we're staying in the integers. I can update my simulation to try out various EWMA and see what I think after that.
I spent some time looking into EWMA, and I think it's just going to add more crazy things to tune here. Perhaps keeping it simple just by switching to floats is the best way for now, esp given that I've also discovered I need to solve #3443 (moved), #7341 (moved), and #7440 (moved) in order to improve this log message's frequency and information.
Trac: Keywords: MikePerry201211 deleted, MikePerry201212 added Actualpoints: 8 to 10
I believe all the above issues are fixed in mikeperry/209-path-bias-changes, however please pay careful attention to the XXX comments. I have a few questions in those comments for code review. See also #7691 (moved).
I've tested this extensively over about 10,000 circuits in combination with a fix for #7440 (moved), and #7341 (moved). Note that #7341 (moved) is a separate branch (mikeperry/bug7341), but #7440 (moved) has a ton of fixes mixed in 209-path-bias-changes. The whole thing has been rebased on origin/master after the #3443 (moved) merge. I've tested hidden service client+server, ipv4, DNS, and a few other circuit usage types.
See changes file for a summary of the fixes.
Trac: Status: needs_revision to needs_review Actualpoints: 10 to 19
Since it's one branch, I'll review the branch in one place.
This is part one of the review: I need to take a break now. It covers up to but
not including 721f7e375114abfc, with a little skipping around in in the diff
that I've been reading side-by-side to see what stays and what gets fixed.
Part one begins:
pathbias_get_dropguards -- I think it's supposed to give a boolean, bxut it's
taking its input in range 0..100 and dividing it by 100.0. Assuming I'm
right about it being a boolean, you need to chance its type in config.c to
"autobool", so that setting it to "auto" can mean "use the value from the
consensus".
I think the comment on pathbias_get_scale_factor might be wrong; does the
code still refrain from scaling in the event of truncation? It doesn't
appear to be the case.
In entry_guards_update_state, you're using e->circ_attempts in a boolean
context, which compares it to 0.0; that's potentially yucky. You shouldn't
do equality checks on doubles.
Do we verify that scale_factor strictly less than mult_factor? It looks like
in ef1b830ef8d751172ebe5 we removed the requirement, making it so
scale_factor == mult_factor is allowed.
pathbias_count_successful_close (and others that rely on
entry_guard_get_by_id_digest) is going to hit its LD_BUG whenever we have a
node stop being a guard after we open a circuit through it.
I'm concerned about what happens with circuits that are around when Tor dies.
Do they count as successes, or failures, or not at all? I think right now
they're counted as never having closed sucessfully; is that so? It seems
kind of odd to compare "successfully closed circuits" to "circuits for which
we've been the first hop."
Don't use memcmp. Use fast_memeq or tor_memeq or fast_memcmp as appropriate.
This business with counting streams successfully sent on a circuit, using end
stream reasons to say which are okay and which aren't, and so on... is it
specified anywhere ? I'm not seeing it in proposal 209.
The comment on any_streams_succeeded seems wrong; it's not SOCKS streams,
it's any client streams, right?
Re 7f8cbe389d095f673bfc03437e1f7521abae698b: It is indeed redundant. Looks
like you fixed that in c3028edba6f8474175a59ad22dd0d3ca23c60f85. Those are
two good ones to squash before merge.
"and it shat itself" ?
Re the XXX in connection_ap_process_end_not_open: 'digest' is hard enough to
spoof that if the digest is correct, you can assume you're getting the right
cell with fairly high probability.
Gcc gives me these warnings:
src/or/circuitbuild.c:1020: warning: no previous prototype for ‘pathbias_get_warn_rate’src/or/circuitbuild.c: In function ‘pathbias_get_dropguards’:src/or/circuitbuild.c:1058: warning: implicit conversion shortens 64-bit value into a 32-bit valuesrc/or/circuitbuild.c: In function ‘entry_guard_inc_circ_attempt_count’:src/or/circuitbuild.c:1669: warning: cast from function call of type ‘double’ to non-matching type ‘int’src/or/circuitbuild.c:1688: warning: cast from function call of type ‘double’ to non-matching type ‘int’src/or/circuitbuild.c:1706: warning: cast from function call of type ‘double’ to non-matching type ‘int’src/or/circuitbuild.c:1725: warning: cast from function call of type ‘double’ to non-matching type ‘int’make[1]: *** [src/or/circuitbuild.o] Error 1cc1: warnings being treated as errorssrc/or/connection_edge.c: In function ‘connection_ap_handshake_socks_reply’:src/or/connection_edge.c:2191: warning: format ‘%ld’ expects type ‘long int’, but argument 5 has type ‘uint64_t’make[1]: *** [src/or/connection_edge.o] Error 1cc1: warnings being treated as errorssrc/or/entrynodes.c: In function ‘entry_guards_update_state’:src/or/entrynodes.c:1197: warning: comparing floating point with == or != is unsafe
Part one ends. Next: review 721f7e375114abfcb1 and on.
Okay, part two (which covers the rest of the branch up to ccaeef22e168af34e9b6a63d65ce17e58dd702e2) is much shorter:
Documentation on pathbias_check_close seems inadequate; it should
really say what the function actually DOES. It happily calls
pathbias_count_successful_close.
Setting timestamp_dirty in circuit_extend_to_new_exit seems wrong.
The circuit isn't dirty! Maybe overloading timestamp_dirty in this
way, rather than grabbing a new bit, is a mistake?
04866055e8dadc9eb5b09773 freaks me out. This feels like a significant
redesign of the whole concept, and I didn't see it mentioned on
tor-dev anywhere or on the tickets. "As long as end-to-end tagging is
possible, we assume the adversary will use it over hop-to-hop
failure"? Is there any discussion of this anyplace? (The idea here
seems to be that entry nodes are allowed to filter for the second hop
as much as they like, and the pathbias code won't care, but that if
they start tagging/filtering for the third hop or later, they need to
be called out. Is that right?)
In general: use %f to printf a double, but %lf to scanf a double. At
least one compiler we care about complains when it sees %lf in a
printf format string.