Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#7157 closed enhancement (implemented)

"Low circuit success rate %u/%u for guard %s=%s."

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client, MikePerry201212
Cc: mikeperry Actual Points: 19
Parent ID: #5456 Points:
Reviewer: Sponsor:

Description

I'm seeing this "low circuit success rate" warning on my client sometimes when on a flaky wireless network connection.

I'm guessing that the problem is that we're blaming the guard when we should in fact be blaming the network.

We should a) figure out how to improve the log message so we get more hints about where our bug is, and/or b) just straight-out hunt for it.

Mike suggests raising the number of circuits we have to make before we'll spit out the warning. I expect that will mask the bug, but not find or resolve it.

Bug report originally raised in #7066.

Child Tickets

Change History (34)

comment:1 Changed 7 years ago by mikeperry

If there is a bug with how we count circuits, what is it about changing the minimum number of circuits that could possibly change this bug?

Also, were you using weird torrc options, testing hidden services, or doing anything atypical? I seem to recall wacky torrc options of yours causing problems in the past.

One thing we could do is record what fraction of the failures were timeouts in the log line, as well as the current CBT 'right-censored' timeout value (which the path bias stuff uses instead of the 80% timeout value). These might be useful datapoints from people who only give us the one logline and then disappear..

comment:2 in reply to:  1 Changed 7 years ago by mikeperry

Replying to mikeperry:

One thing we could do is record what fraction of the failures were timeouts in the log line, as well as the current CBT 'right-censored' timeout value (which the path bias stuff uses instead of the 80% timeout value). These might be useful datapoints from people who only give us the one logline and then disappear..

I have a branch with these log updates in mikeperry/209-path-bias-changes. I also updated the proposal to match in my torspec remote mikeperry/path-bias-tuning.

comment:3 Changed 7 years ago by nickm

Status: newneeds_revision

quick review:

  • pathbias_count_timeout() should be declared in circuitbuild.h, not or.h. We shouldn't be declaring functions in or.h any more.
  • "crited" isn't a word; what do you mean it to mean? If "crit" is the verb, I think "critted" is the past participle, but...?
  • You can't take something out of or_options_t without adding an OBSOLETE() entry for it, or an alias for it in the aliases table. Otherwise every torrc that used it would suddenly break.
  • All functions and fields and options need documentation. I shouldn't have to go read the proposal to find out what a "mult factor" is; the comments and the manpage should explain it.
  • I have a hard time understanding why the code does "%" in "((mult_factor * foo) % scale_factor) == 0". I guess that you're trying to avoid integer truncation, but won't this prevent scaling entirely? Suppose scale_factor is 100. It seems to me that unless first_hops and circuit_successes are both divisible by 100 at the same time, they'll never get scaled. Can't we just reserve scaling for when the values are "big enough" that integer truncation won't matter?
  • LD_PROTOCOL might not be appropriate here; that's for protocol violations, not for malfunctioning nodes.
  • The "your guard %s=%s" code should probably get refactored to use the same format as node_describe , and probably to be an independent function.

comment:4 Changed 7 years ago by nickm

Also: "needs a changes file"

comment:5 in reply to:  3 Changed 7 years ago by mikeperry

Actual Points: 8
Keywords: MikePerry201210 added
Status: needs_revisionneeds_review
Type: defectenhancement

Replying to nickm:

quick review:

  • I have a hard time understanding why the code does "%" in "((mult_factor * foo) % scale_factor) == 0". I guess that you're trying to avoid integer truncation, but won't this prevent scaling entirely? Suppose scale_factor is 100. It seems to me that unless first_hops and circuit_successes are both divisible by 100 at the same time, they'll never get scaled. Can't we just reserve scaling for when the values are "big enough" that integer truncation won't matter?

In my testing, any amount of integer truncation eventually led to the accumulation of enough roundoff error for us to either overcount or undercount. It was only a matter of time for larger values.

I commented the scale_factor and mult_factor functions to be more clear about the fact that we only scale if there is no integer truncation, and that this means we should be careful when changing those parameters.

  • The "your guard %s=%s" code should probably get refactored to use the same format as node_describe , and probably to be an independent function.

I looked at node_describe and router_describe, and they seemed overkill for this. Both functions output a bunch of info we simply do not have available for entry_guard_t (since we might not have a node_t or a routerinfo_t for the guard).. Effectively I'd be writing a custom utility function just to do "%s=%s" in this case, which seemed silly.

Everything else you mentioned has been pushed to that branch. I'll try to test a build with this code and #3443 soon.

comment:6 Changed 7 years ago by mikeperry

Commenting here to checkpoint my progress for tomorrow.

I appear to have found at least one bug in testing so far: There appears to be a discrepancy between the PathBias timeout counting and the CBT right-censored CircuitBuildAbandonedCount values..

I think the error is in circuit_expire_building() and the difference between the calls to circuit_build_times_count_close() and pathbias_count_timeout().. I still haven't fully conceptualized all of the various cases of circuit_expire_building() though. Man is it a hairy mess. :/

comment:7 Changed 7 years ago by mikeperry

Also just hit: src/or/networkstatus.c:2216 get_net_param_from_list: Assertion max_val > min_val failed; aborting.

It looks like get_net_param_from_list() should allow min_val == max_val based on its codepaths, but I'll go ahead and change my code instead. Easier than filing a new bug+changelog entry.

comment:8 in reply to:  6 Changed 7 years ago by mikeperry

Replying to mikeperry:

I appear to have found at least one bug in testing so far: There appears to be a discrepancy between the PathBias timeout counting and the CBT right-censored CircuitBuildAbandonedCount values..

I think the error is in circuit_expire_building() and the difference between the calls to circuit_build_times_count_close() and pathbias_count_timeout().. I still haven't fully conceptualized all of the various cases of circuit_expire_building() though. Man is it a hairy mess. :/

This discrepancy turns out to be due to the hidden service cases in that function. There is no discrepancy for normal exit circuits.. It also is likely the source of potentially deeper issues with proper timeout usage for hidden service circuits...

comment:9 Changed 7 years ago by nickm

In my testing, any amount of integer truncation eventually led to the accumulation of enough roundoff error for us to either overcount or undercount. It was only a matter of time for larger values.

I commented the scale_factor and mult_factor functions to be more clear about the fact that we only scale if there is no integer truncation, and that this means we should be careful when changing those parameters.

Hm. I am still not comfortable about this. Can we think about what this implies for possible values for those scaling factors? It seems possible (though unlikely) to wind up with a sequence of events in which we can *never* scale downwards, even with scale_factor == 2. If you're worried about integer roundoff, mightn't the answer be instead to make sure that the scale threshold is high, and to perform rounding rather than truncating division?

This discrepancy turns out to be due to the hidden service cases in that function. There is no discrepancy for normal exit circuits.. It also is likely the source of potentially deeper issues with proper timeout usage for hidden service circuits...

Ouch. So I'm assuming this is in "don't merge it right now" state then? Could there be some way to re-think the design so that we *can't* overcount/undercount? Perhaps to have a flag that's set the first time we count something or render it uncountable, so we can't double-count it? Or is this not an overcounting/undercounting thing?

comment:10 in reply to:  9 Changed 7 years ago by mikeperry

Replying to nickm:

In my testing, any amount of integer truncation eventually led to the accumulation of enough roundoff error for us to either overcount or undercount. It was only a matter of time for larger values.

I commented the scale_factor and mult_factor functions to be more clear about the fact that we only scale if there is no integer truncation, and that this means we should be careful when changing those parameters.

Hm. I am still not comfortable about this. Can we think about what this implies for possible values for those scaling factors? It seems possible (though unlikely) to wind up with a sequence of events in which we can *never* scale downwards, even with scale_factor == 2. If you're worried about integer roundoff, mightn't the answer be instead to make sure that the scale threshold is high, and to perform rounding rather than truncating division?

Hrmm.. I'm pretty sure rounding will only reduce the error.. It can still accumulate.. Bleh. I'm not seeing any clear way forward here, save for switching everything to floating point. :/

This discrepancy turns out to be due to the hidden service cases in that function. There is no discrepancy for normal exit circuits.. It also is likely the source of potentially deeper issues with proper timeout usage for hidden service circuits...

Ouch. So I'm assuming this is in "don't merge it right now" state then? Could there be some way to re-think the design so that we *can't* overcount/undercount? Perhaps to have a flag that's set the first time we count something or render it uncountable, so we can't double-count it? Or is this not an overcounting/undercounting thing?

Yeah, so the discrepancy I'm talking about wrt hidden services is not for the defense. It's for the informational "You've had these many failures count as timeouts" part of the log. The issue is in how circuit_expire_building deals with CBT. It's probably actually a separate bug.

The actual success counts are already enforced with such a state flag.

However, yeah, I'm still testing with this branch. Don't merge it just yet.

comment:11 Changed 7 years ago by mikeperry

Keywords: MikePerry201211 added; MikePerry201210 removed
Status: needs_reviewneeds_revision

I created #7341 for the timeout issue. Any magical insight on the truncation problem? Should we just switch to double-precision floating point for the state file counts? Or maybe we should use a real (weighted?) moving average and simply store a count and a double rate value, and publish some kind of weight in the consensus?

I am worried that the pattern Nick mentions is technically exploitable by guards to prevent scaling. I don't know if I should be more worried about that, or about the ability to ensure accuracy..

comment:12 in reply to:  11 ; Changed 7 years ago by nickm

Replying to mikeperry:

I created #7341 for the timeout issue. Any magical insight on the truncation problem? Should we just switch to double-precision floating point for the state file counts? Or maybe we should use a real (weighted?) moving average and simply store a count and a double rate value, and publish some kind of weight in the consensus?

Hm. I think that using doubles would be okay IMO. It sounds less error-prone than most of the other approaches. Does it sound less error-prone to you?

comment:13 in reply to:  12 Changed 7 years ago by mikeperry

Replying to nickm:

Replying to mikeperry:

I created #7341 for the timeout issue. Any magical insight on the truncation problem? Should we just switch to double-precision floating point for the state file counts? Or maybe we should use a real (weighted?) moving average and simply store a count and a double rate value, and publish some kind of weight in the consensus?

Hm. I think that using doubles would be okay IMO. It sounds less error-prone than most of the other approaches. Does it sound less error-prone to you?

Less error-prone to get to from what we've got, yes. However, I am still tempted to change it to an EWMA of some kind, because that's more typical and less crazy than this scaling idea, which I think only makes sense if we're staying in the integers. I can update my simulation to try out various EWMA and see what I think after that.

comment:14 Changed 7 years ago by mikeperry

Parent ID: #5456

comment:15 Changed 7 years ago by mikeperry

Actual Points: 810
Keywords: MikePerry201212 added; MikePerry201211 removed

I spent some time looking into EWMA, and I think it's just going to add more crazy things to tune here. Perhaps keeping it simple just by switching to floats is the best way for now, esp given that I've also discovered I need to solve #3443, #7341, and #7440 in order to improve this log message's frequency and information.

comment:16 Changed 7 years ago by mikeperry

Actual Points: 1019
Status: needs_revisionneeds_review

I believe all the above issues are fixed in mikeperry/209-path-bias-changes, however please pay careful attention to the XXX comments. I have a few questions in those comments for code review. See also #7691.

I've tested this extensively over about 10,000 circuits in combination with a fix for #7440, and #7341. Note that #7341 is a separate branch (mikeperry/bug7341), but #7440 has a ton of fixes mixed in 209-path-bias-changes. The whole thing has been rebased on origin/master after the #3443 merge. I've tested hidden service client+server, ipv4, DNS, and a few other circuit usage types.

See changes file for a summary of the fixes.

comment:17 Changed 7 years ago by nickm

Since it's one branch, I'll review the branch in one place.

This is part one of the review: I need to take a break now. It covers up to but
not including 721f7e375114abfc, with a little skipping around in in the diff
that I've been reading side-by-side to see what stays and what gets fixed.

Part one begins:

pathbias_get_dropguards -- I think it's supposed to give a boolean, bxut it's
taking its input in range 0..100 and dividing it by 100.0. Assuming I'm
right about it being a boolean, you need to chance its type in config.c to
"autobool", so that setting it to "auto" can mean "use the value from the
consensus".

I think the comment on pathbias_get_scale_factor might be wrong; does the
code still refrain from scaling in the event of truncation? It doesn't
appear to be the case.

In entry_guards_update_state, you're using e->circ_attempts in a boolean
context, which compares it to 0.0; that's potentially yucky. You shouldn't
do equality checks on doubles.

Do we verify that scale_factor strictly less than mult_factor? It looks like
in ef1b830ef8d751172ebe5 we removed the requirement, making it so
scale_factor == mult_factor is allowed.

pathbias_count_successful_close (and others that rely on
entry_guard_get_by_id_digest) is going to hit its LD_BUG whenever we have a
node stop being a guard after we open a circuit through it.

I'm concerned about what happens with circuits that are around when Tor dies.
Do they count as successes, or failures, or not at all? I think right now
they're counted as never having closed sucessfully; is that so? It seems
kind of odd to compare "successfully closed circuits" to "circuits for which
we've been the first hop."

Don't use memcmp. Use fast_memeq or tor_memeq or fast_memcmp as appropriate.

This business with counting streams successfully sent on a circuit, using end
stream reasons to say which are okay and which aren't, and so on... is it
specified anywhere ? I'm not seeing it in proposal 209.

The comment on any_streams_succeeded seems wrong; it's not SOCKS streams,
it's any client streams, right?

Re 7f8cbe389d095f673bfc03437e1f7521abae698b: It is indeed redundant. Looks
like you fixed that in c3028edba6f8474175a59ad22dd0d3ca23c60f85. Those are
two good ones to squash before merge.

"and it shat itself" ?

Re the XXX in connection_ap_process_end_not_open: 'digest' is hard enough to
spoof that if the digest is correct, you can assume you're getting the right
cell with fairly high probability.

Gcc gives me these warnings:

src/or/circuitbuild.c:1020: warning: no previous prototype for ‘pathbias_get_warn_rate’
src/or/circuitbuild.c: In function ‘pathbias_get_dropguards’:
src/or/circuitbuild.c:1058: warning: implicit conversion shortens 64-bit value into a 32-bit value
src/or/circuitbuild.c: In function ‘entry_guard_inc_circ_attempt_count’:
src/or/circuitbuild.c:1669: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1688: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1706: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1725: warning: cast from function call of type ‘double’ to non-matching type ‘int’
make[1]: *** [src/or/circuitbuild.o] Error 1
cc1: warnings being treated as errors
src/or/connection_edge.c: In function ‘connection_ap_handshake_socks_reply’:
src/or/connection_edge.c:2191: warning: format ‘%ld’ expects type ‘long int’, but argument 5 has type ‘uint64_t’
make[1]: *** [src/or/connection_edge.o] Error 1
cc1: warnings being treated as errors
src/or/entrynodes.c: In function ‘entry_guards_update_state’:
src/or/entrynodes.c:1197: warning: comparing floating point with == or != is unsafe

Part one ends. Next: review 721f7e375114abfcb1 and on.

comment:18 Changed 7 years ago by nickm

Status: needs_reviewneeds_revision

Okay, part two (which covers the rest of the branch up to ccaeef22e168af34e9b6a63d65ce17e58dd702e2) is much shorter:

Documentation on pathbias_check_close seems inadequate; it should
really say what the function actually DOES. It happily calls
pathbias_count_successful_close.

Setting timestamp_dirty in circuit_extend_to_new_exit seems wrong.
The circuit *isn't* dirty! Maybe overloading timestamp_dirty in this
way, rather than grabbing a new bit, is a mistake?

04866055e8dadc9eb5b09773 freaks me out. This feels like a significant
redesign of the whole concept, and I didn't see it mentioned on
tor-dev anywhere or on the tickets. "As long as end-to-end tagging is
possible, we assume the adversary will use it over hop-to-hop
failure"? Is there any discussion of this anyplace? (The idea here
seems to be that entry nodes are allowed to filter for the second hop
as much as they like, and the pathbias code won't care, but that if
they start tagging/filtering for the third hop or later, they need to
be called out. Is that right?)

In general: use %f to printf a double, but %lf to scanf a double. At
least one compiler we care about complains when it sees %lf in a
printf format string.

comment:19 in reply to:  17 ; Changed 7 years ago by mikeperry

Replying to nickm:

Since it's one branch, I'll review the branch in one place.

This is part one of the review: I need to take a break now. It covers up to but
not including 721f7e375114abfc, with a little skipping around in in the diff
that I've been reading side-by-side to see what stays and what gets fixed.

Ok. If I don't reply to a comment, assume it is simply fixed.

pathbias_get_dropguards -- I think it's supposed to give a boolean, bxut it's
taking its input in range 0..100 and dividing it by 100.0. Assuming I'm
right about it being a boolean, you need to chance its type in config.c to
"autobool", so that setting it to "auto" can mean "use the value from the
consensus".

Ok. I fixed this, but do we want the AUTO prefix for any of the other path bias configs? We all want them to default to consensus values, I think?

Do we verify that scale_factor strictly less than mult_factor? It looks like
in ef1b830ef8d751172ebe5 we removed the requirement, making it so
scale_factor == mult_factor is allowed.

I think this is fine. It doesn't cause damage if this happens, it merely allows us to disable scaling entirely if we want, right?

pathbias_count_successful_close (and others that rely on
entry_guard_get_by_id_digest) is going to hit its LD_BUG whenever we have a
node stop being a guard after we open a circuit through it.

Hrmm. Suggestions here? The log is at info, but if we know it can happen, I guess you mean to say it should be a different log domain?

Is there anything better we can do to handle that case?

I'm concerned about what happens with circuits that are around when Tor dies.
Do they count as successes, or failures, or not at all? I think right now
they're counted as never having closed sucessfully; is that so? It seems
kind of odd to compare "successfully closed circuits" to "circuits for which
we've been the first hop."

No, we give currently open circuits the benefit of the doubt. See pathbias_get_closed_count(). We count currently opened circuits as successfully closed for that guard whenever we evaluate our counts or write the state file.

This business with counting streams successfully sent on a circuit, using end
stream reasons to say which are okay and which aren't, and so on... is it
specified anywhere ? I'm not seeing it in proposal 209.

No, it is based on discussions with Rob Jansen after he reviewed Prop209. I haven't documented it yet. It's closely related to #7691.. Should I document both of those in the proposal, or try to condense the whole thing into a spec update?

The comment on any_streams_succeeded seems wrong; it's not SOCKS streams,
it's any client streams, right?

This got cleaned up and refactored into a path_state_t flag in later commits.

"and it shat itself" ? [in connection_ap_process_end_not_open()]

Past tense of "to shit [oneself]". ;)

While writing #7691, I actually discovered that exits will send back END_STREAM_REASON_TORPROTOCOL if you send them utter garbage (I used the wrong hop's cpath to send them the cell at first). However, after I looked closer at the code, it seems quite possible for the other two reason codes to come back for more subtle precision tagging.

Gcc gives me these warnings:

src/or/circuitbuild.c:1020: warning: no previous prototype for ‘pathbias_get_warn_rate’
src/or/circuitbuild.c: In function ‘pathbias_get_dropguards’:
src/or/circuitbuild.c:1058: warning: implicit conversion shortens 64-bit value into a 32-bit value
src/or/circuitbuild.c: In function ‘entry_guard_inc_circ_attempt_count’:
src/or/circuitbuild.c:1669: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1688: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1706: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1725: warning: cast from function call of type ‘double’ to non-matching type ‘int’
make[1]: *** [src/or/circuitbuild.o] Error 1
cc1: warnings being treated as errors
src/or/connection_edge.c: In function ‘connection_ap_handshake_socks_reply’:
src/or/connection_edge.c:2191: warning: format ‘%ld’ expects type ‘long int’, but argument 5 has type ‘uint64_t’
make[1]: *** [src/or/connection_edge.o] Error 1
cc1: warnings being treated as errors
src/or/entrynodes.c: In function ‘entry_guards_update_state’:
src/or/entrynodes.c:1197: warning: comparing floating point with == or != is unsafe

Yikes. Didn't realize we made the default build warningless.

All of the above fixes should be pushed to their own 'part 1' review commit shortly.

comment:20 in reply to:  19 ; Changed 7 years ago by nickm

Replying to mikeperry:

pathbias_get_dropguards -- I think it's supposed to give a boolean, bxut it's
taking its input in range 0..100 and dividing it by 100.0. Assuming I'm
right about it being a boolean, you need to chance its type in config.c to
"autobool", so that setting it to "auto" can mean "use the value from the
consensus".

Ok. I fixed this, but do we want the AUTO prefix for any of the other path bias configs? We all want them to default to consensus values, I think?

The AUTOBOOL thing is special. For numeric values, you can do what we're doing and have a special "out of bounds" value that indicates . But regular booleans only take one value.

Do we verify that scale_factor strictly less than mult_factor? It looks like
in ef1b830ef8d751172ebe5 we removed the requirement, making it so
scale_factor == mult_factor is allowed.

I think this is fine. It doesn't cause damage if this happens, it merely allows us to disable scaling entirely if we want, right?

I think that's okay so long as we definitely prevent scale_factor > mult_factor.

pathbias_count_successful_close (and others that rely on
entry_guard_get_by_id_digest) is going to hit its LD_BUG whenever we have a
node stop being a guard after we open a circuit through it.

Hrmm. Suggestions here? The log is at info, but if we know it can happen, I guess you mean to say it should be a different log domain?

Is there anything better we can do to handle that case?

Well, it's not a bug. A different log domain seems in order. I'm not sure that other fixes are needed; will the code still work in that case?

I'm concerned about what happens with circuits that are around when Tor dies.
Do they count as successes, or failures, or not at all? I think right now
they're counted as never having closed sucessfully; is that so? It seems
kind of odd to compare "successfully closed circuits" to "circuits for which
we've been the first hop."

No, we give currently open circuits the benefit of the doubt. See pathbias_get_closed_count(). We count currently opened circuits as successfully closed for that guard whenever we evaluate our counts or write the state file.

This business with counting streams successfully sent on a circuit, using end
stream reasons to say which are okay and which aren't, and so on... is it
specified anywhere ? I'm not seeing it in proposal 209.

No, it is based on discussions with Rob Jansen after he reviewed Prop209. I haven't documented it yet. It's closely related to #7691.. Should I document both of those in the proposal, or try to condense the whole thing into a spec update?

A short little email to tor-dev to follow-up on the proposal, or whatever would be fastest for you, would be cool. Just so there's some record of what you're doing and why.

The comment on any_streams_succeeded seems wrong; it's not SOCKS streams,
it's any client streams, right?

This got cleaned up and refactored into a path_state_t flag in later commits.

Okay.

"and it shat itself" ? [in connection_ap_process_end_not_open()]

Past tense of "to shit [oneself]". ;)

Yeah, but the exit didn't shit itself. It just gave an error.

I'm not against cussing in software, but this would be the first of George Carlin's Dirty Seven in Tor, and I'd like to save that for a more special occasion.

While writing #7691, I actually discovered that exits will send back END_STREAM_REASON_TORPROTOCOL if you send them utter garbage (I used the wrong hop's cpath to send them the cell at first). However, after I looked closer at the code, it seems quite possible for the other two reason codes to come back for more subtle precision tagging.

Gcc gives me these warnings:

src/or/circuitbuild.c:1020: warning: no previous prototype for ‘pathbias_get_warn_rate’
src/or/circuitbuild.c: In function ‘pathbias_get_dropguards’:
src/or/circuitbuild.c:1058: warning: implicit conversion shortens 64-bit value into a 32-bit value
src/or/circuitbuild.c: In function ‘entry_guard_inc_circ_attempt_count’:
src/or/circuitbuild.c:1669: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1688: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1706: warning: cast from function call of type ‘double’ to non-matching type ‘int’
src/or/circuitbuild.c:1725: warning: cast from function call of type ‘double’ to non-matching type ‘int’
make[1]: *** [src/or/circuitbuild.o] Error 1
cc1: warnings being treated as errors
src/or/connection_edge.c: In function ‘connection_ap_handshake_socks_reply’:
src/or/connection_edge.c:2191: warning: format ‘%ld’ expects type ‘long int’, but argument 5 has type ‘uint64_t’
make[1]: *** [src/or/connection_edge.o] Error 1
cc1: warnings being treated as errors
src/or/entrynodes.c: In function ‘entry_guards_update_state’:
src/or/entrynodes.c:1197: warning: comparing floating point with == or != is unsafe

Yikes. Didn't realize we made the default build warningless.

-Wall, not warningless. It's been that way for a while, I thought.

Maybe --enable-gcc-warnings-advisory (or whatever it's called) should be default.

All of the above fixes should be pushed to their own 'part 1' review commit shortly.

comment:21 in reply to:  20 Changed 7 years ago by mikeperry

Replying to nickm:

Replying to mikeperry:

Do we verify that scale_factor strictly less than mult_factor? It looks like
in ef1b830ef8d751172ebe5 we removed the requirement, making it so
scale_factor == mult_factor is allowed.

I think this is fine. It doesn't cause damage if this happens, it merely allows us to disable scaling entirely if we want, right?

I think that's okay so long as we definitely prevent scale_factor > mult_factor.

We do when the value comes from the consensus through the bounds on networkstatus_get_param(). We currently let the user set whatever insane values they like for all of these parameters, though...

The part 1 commit is now pushed, btw.

comment:22 Changed 7 years ago by mikeperry

Replying to nickm:

Okay, part two (which covers the rest of the branch up to ccaeef22e168af34e9b6a63d65ce17e58dd702e2) is much shorter:

Setting timestamp_dirty in circuit_extend_to_new_exit seems wrong.
The circuit *isn't* dirty! Maybe overloading timestamp_dirty in this
way, rather than grabbing a new bit, is a mistake?

Hrmm. Both codepaths that cause circuit_extend_to_new_exit to fire seem to indicate it's dirty to me: We're either re-extending an intro circ, or cannibalizing an existing circuit to send it to a specific new exit. In neither of these cases do we really want to re-use this circuit for some random new traffic, right?

You'll note that I also added a dirty timestamp in rend_service_rendezvous_has_opened() for similar reasons..

However, if this bit causes the code that did the re-extending/opening to suddenly decide "Eww, a dirty circuit. Better make a fresh one!" once the extend completes, that could be bad. It didn't *seem* to do this, but some of those cannibalization corner cases are certainly screwy...

04866055e8dadc9eb5b09773 freaks me out. This feels like a significant
redesign of the whole concept, and I didn't see it mentioned on
tor-dev anywhere or on the tickets. "As long as end-to-end tagging is
possible, we assume the adversary will use it over hop-to-hop
failure"? Is there any discussion of this anyplace? (The idea here
seems to be that entry nodes are allowed to filter for the second hop
as much as they like, and the pathbias code won't care, but that if
they start tagging/filtering for the third hop or later, they need to
be called out. Is that right?)

Yes, this is the idea, but there has been no discussion yet. I added it after watching the end-to-end circuit success rate repeatedly fluctuate between 90% and 50%, due almost entirely to per-hop onionskin failure (CPU overload). In fact, what was more worrying was my own aggressive testing seemed to be what was driving the network into these overload conditions, so there is seemingly not very much load margin between "fine-and-dandy" and "overload" right now.

Waiting until the second hop completes removes a lot of the effect of this without impacting what we're looking for (guard-to-exit bias). In fact, this is the part of the branch most likely to make the most impact on Roger's original complaint about the frequency of the logline in the ticket title. To see why, imagine that each node occasionally experiences as much as 25% circuit failure. If we count after the first hop in such a network, that's a success rate of only (1-.25)*(1-.25) = 0.56, which triggers our notice alarm (set at 70%). However, if we wait until two hops, the alarm should not trigger in this same scenario.

Further, because of this squaring of the success rates, per-hop failure is way less appealing than end-to-end tagging for the same amount of network compromise. An adversary that controls 30% of the network would have to drive the end-to-end circuit success rate down to 9% for per-hop failure, but only 30% for end-to-end tagging-based failure. In either case, we'd still catch the per-hop adversary as they failed their last hop, just as the end-to-end tagger would.

In general: use %f to printf a double, but %lf to scanf a double. At
least one compiler we care about complains when it sees %lf in a
printf format string.

Bleh, this is a painfully messy one to find+clean up. I'll do my best.

comment:23 Changed 7 years ago by mikeperry

Actually, "no discussion" about the per-hop ambient failure rate problem isn't true. It's also something I discussed with Rob Jansen (and also Anupam Das).

comment:24 in reply to:  22 Changed 7 years ago by nickm

Replying to mikeperry:

Replying to nickm:

Okay, part two (which covers the rest of the branch up to ccaeef22e168af34e9b6a63d65ce17e58dd702e2) is much shorter:

Setting timestamp_dirty in circuit_extend_to_new_exit seems wrong.
The circuit *isn't* dirty! Maybe overloading timestamp_dirty in this
way, rather than grabbing a new bit, is a mistake?

Hrmm. Both codepaths that cause circuit_extend_to_new_exit to fire seem to indicate it's dirty to me: We're either re-extending an intro circ, or cannibalizing an existing circuit to send it to a specific new exit. In neither of these cases do we really want to re-use this circuit for some random new traffic, right?

That's an abstraction abuse. Just because in every case where we currently cannibalize a circuit, we want to mark it dirty, that doesn't mean that cannibalizing a circuit should mark it dirty, right?

You'll note that I also added a dirty timestamp in rend_service_rendezvous_has_opened() for similar reasons..

But that one really is a dirtying thing: it means that the circuit has been used.

However, if this bit causes the code that did the re-extending/opening to suddenly decide "Eww, a dirty circuit. Better make a fresh one!" once the extend completes, that could be bad. It didn't *seem* to do this, but some of those cannibalization corner cases are certainly screwy...

04866055e8dadc9eb5b09773 freaks me out. This feels like a significant
redesign of the whole concept, and I didn't see it mentioned on
tor-dev anywhere or on the tickets. "As long as end-to-end tagging is
possible, we assume the adversary will use it over hop-to-hop
failure"? Is there any discussion of this anyplace? (The idea here
seems to be that entry nodes are allowed to filter for the second hop
as much as they like, and the pathbias code won't care, but that if
they start tagging/filtering for the third hop or later, they need to
be called out. Is that right?)

Yes, this is the idea, but there has been no discussion yet. I added it after watching the end-to-end circuit success rate repeatedly fluctuate between 90% and 50%, due almost entirely to per-hop onionskin failure (CPU overload).

[...]

What you say is plausible, but it really needs to go on tor-dev or in the proposal or somewhere so we have a record of our design and rationale. We need to get better at making sure that development discussion of this kind goes on in tor-dev, or at least gets copied there once there's a consensus.

comment:25 Changed 7 years ago by mikeperry

Ok, I pushed the changes for review 2. I'll also write up the three major changes from the proposal to tor-dev (the two you mentioned + #7691).

I did not do anything with timestamp_dirty yet. I'm pretty sure we want to keep at least the hidserv timestamp_dirty additions.. Do you just want me to move the timestamp_dirty change to the cannibalization code? It does appear like that might make us decide against re-cannibalizing it, and may impact our use of cannibalized circuits under predictive building conditions + new identity.. hrmm...

If you want me to switch to another path_state_t flag, I can do that, but it's a bit more changes+refactoring.

comment:26 in reply to:  25 ; Changed 7 years ago by nickm

Replying to mikeperry:

Ok, I pushed the changes for review 2. I'll also write up the three major changes from the proposal to tor-dev (the two you mentioned + #7691).

Great; thank you.

I did not do anything with timestamp_dirty yet. I'm pretty sure we want to keep at least the hidserv timestamp_dirty additions.. Do you just want me to move the timestamp_dirty change to the cannibalization code? It does appear like that might make us decide against re-cannibalizing it, and may impact our use of cannibalized circuits under predictive building conditions + new identity.. hrmm...

At this point, I think whichever option is simplest would be best here. This is tricky stuff, and at least your existing code is tested... but if it breaks functionality we care about, we need to think of the simplest fix there. Regressions are teh suxx0r.

If you want me to switch to another path_state_t flag, I can do that, but it's a bit more changes+refactoring.

comment:27 in reply to:  26 ; Changed 7 years ago by mikeperry

Replying to nickm:

Replying to mikeperry:

I did not do anything with timestamp_dirty yet. I'm pretty sure we want to keep at least the hidserv timestamp_dirty additions.. Do you just want me to move the timestamp_dirty change to the cannibalization code? It does appear like that might make us decide against re-cannibalizing it, and may impact our use of cannibalized circuits under predictive building conditions + new identity.. hrmm...

At this point, I think whichever option is simplest would be best here. This is tricky stuff, and at least your existing code is tested... but if it breaks functionality we care about, we need to think of the simplest fix there. Regressions are teh suxx0r.

If you want me to switch to another path_state_t flag, I can do that, but it's a bit more changes+refactoring.

I do like the idea of having symmetry in the path_state_t transitions.. Right now we have BUILD_ATTEMPTED, BUILD_SUCCEEDED, and USE_SUCCEEDED, but no USE_ATTEMPTED. USE_ATTEMPTED would allow us to stop abusing timestamp_dirty as part of our path_state_t state machine, and *might* stave off future bugs due to additional future use of timestamp_dirty..

If you're willing to hold off on merging this to review #7341 and #7691, it probably makes sense to make this change now. That way, I can retest all three bugs with the same tests once I fix any changes you want me to make to those two bugs (and I'm sure there will be at least a few).

Or, we can file a separate ticket for this path_state_t change, and do the retesting for it and those two after you review them.

comment:28 in reply to:  27 Changed 7 years ago by nickm

Replying to mikeperry:

Replying to nickm:

Replying to mikeperry:

I did not do anything with timestamp_dirty yet. I'm pretty sure we want to keep at least the hidserv timestamp_dirty additions.. Do you just want me to move the timestamp_dirty change to the cannibalization code? It does appear like that might make us decide against re-cannibalizing it, and may impact our use of cannibalized circuits under predictive building conditions + new identity.. hrmm...

At this point, I think whichever option is simplest would be best here. This is tricky stuff, and at least your existing code is tested... but if it breaks functionality we care about, we need to think of the simplest fix there. Regressions are teh suxx0r.

If you want me to switch to another path_state_t flag, I can do that, but it's a bit more changes+refactoring.

I do like the idea of having symmetry in the path_state_t transitions..

I do too, but I *also* like the idea of having this stuff merged RSN.

Right now we have BUILD_ATTEMPTED, BUILD_SUCCEEDED, and USE_SUCCEEDED, but no USE_ATTEMPTED. USE_ATTEMPTED would allow us to stop abusing timestamp_dirty as part of our path_state_t state machine, and *might* stave off future bugs due to additional future use of timestamp_dirty..

If you're willing to hold off on merging this to review #7341 and #7691, it probably makes sense to make this change now. That way, I can retest all three bugs with the same tests once I fix any changes you want me to make to those two bugs (and I'm sure there will be at least a few).

Or, we can file a separate ticket for this path_state_t change, and do the retesting for it and those two after you review them.

I think that's the better idea.

So, modulo the path_state_t thing, are the issues we've talked about all fixed in the branch now?

comment:29 in reply to:  18 Changed 7 years ago by mikeperry

Yes, 406d59a9c93e46bcb5be0e0a5c087f4860522d56 should cover everything.

comment:30 Changed 7 years ago by nickm

Resolution: implemented
Status: needs_revisionclosed

Okay, merged it. Hyvää Joulua ! Does proposal 209 close now?

comment:31 Changed 7 years ago by mikeperry

We still need to do something to describe the changes I noted on tor-dev. We could just fold everything into path-spec.txt, but the final form of #7341 will influence what we actually need to say in the documentation.

I also created #7802 for removing the timestamp_dirty usage.

comment:32 in reply to:  31 Changed 7 years ago by mikeperry

Replying to mikeperry:

We still need to do something to describe the changes I noted on tor-dev. We could just fold everything into path-spec.txt, but the final form of #7341 will influence what we actually need to say in the documentation.

And of course by #7341, I meant #7691 here.

comment:33 in reply to:  31 ; Changed 7 years ago by nickm

Replying to mikeperry:

We still need to do something to describe the changes I noted on tor-dev. We could just fold everything into path-spec.txt, but the final form of #7341 will influence what we actually need to say in the documentation.

Sounds like an okay idea. Want to open a ticket for that, unless we already have one?

comment:34 in reply to:  33 Changed 7 years ago by mikeperry

Replying to nickm:

Replying to mikeperry:

We still need to do something to describe the changes I noted on tor-dev. We could just fold everything into path-spec.txt, but the final form of #7341 will influence what we actually need to say in the documentation.

Sounds like an okay idea. Want to open a ticket for that, unless we already have one?

#7804.

Note: See TracTickets for help on using tickets.