Stop resolving domains locally and check same flags for the 2nd hop

changed milestone to %sbws: 1.0.x-final

added component::core tor/sbws milestone::sbws: 1.0.x-final owner::juga priority::medium resolution::fixed reviewer::pastly sbws-1.0-must-moved-20181128 severity::normal status::closed type::defect version::sbws 1.0.0 labels

https://github.com/torproject/sbws/pull/288

Trac:
Status: assigned to needs_review

sbws is trying to resolve the domain locally, which fails in many cases.

Really? That seems like a problem. Is the system's resolver having issues? Should sbws cache good results it gets back?

Even if it does not fail, the IP obtained won't be the same IP to which the exit will make the HTTP request.

This could be the case, but isn't necessarily the case. For simple destinations (simple like a single webserver as opposed to complex like a CDN) it's most likely not the case that the IPs will be different.

When the domain resolution is failing, sbws try to choose other relay that does not have the exit flag. If it is not an exit, it will fail to make an HTTP request.

If DNS fails line 180 then we go to the else block on line 186 where the "second hop" we pick there is an exit.

So if we are trying to measure an exit and DNS fails, we treat the exit as a non-exit and find an exit to help measure it. This may not be ideal, but it works.

I don't see what you see: I don't see where sbws chooses a non-exit and then tries to use it as the last hop in a circuit.

Right now I don't like these changes and don't agree with merging them. I will also leave comments on the PR, but acknowledging/fixing just them does not mean I think the code is ready to be merged.

Replying to pastly:

sbws is trying to resolve the domain locally, which fails in many cases.

Really? That seems like a problem. Is the system's resolver having issues? Should sbws cache good results it gets back?

If the system resolver can't answer DNS queries, then the operator should install a caching / recursive local resolver. See #28461 (moved).

sbws could cache results, but a proper DNS resolver is going to be better at DNS than any sbws code we can write.

Even if it does not fail, the IP obtained won't be the same IP to which the exit will make the HTTP request.

This could be the case, but isn't necessarily the case. For simple destinations (simple like a single webserver as opposed to complex like a CDN) it's most likely not the case that the IPs will be different.

Even if the IP address is different, the content should be the same: only the latency would vary.

So if we are trying to measure an exit and DNS fails, we treat the exit as a non-exit and find an exit to help measure it. This may not be ideal, but it works.

What happens if sbws gets END_REASON_EXITPOLICY from the exit? (END_REASON_EXITPOLICY is the response when an exit's policy won't allow the IP address that the exit resolved.)

This can happen because:

the exit has just changed its policy, and sbws has an old version
the exit resolves a different IP address from sbws

I think we should measure the relay as a non-exit in this case.

What happens if the exit can't connect because the connection is refused, or times out?

This can happen because:

the exit is busy
the exit is censored
the remote site is down

I think we should measure the relay as a non-exit in this case.

We could cover all these cases by choosing to measure exits as non-exit relays at random. About half the time would work.

Replying to teor:

What happens if sbws gets END_REASON_EXITPOLICY from the exit? (END_REASON_EXITPOLICY is the response when an exit's policy won't allow the IP address that the exit resolved.)

This can happen because:

the exit has just changed its policy, and sbws has an old version

the exit resolves a different IP address from sbws

I think we should measure the relay as a non-exit in this case.

What happens if the exit can't connect because the connection is refused, or times out?

This can happen because:

the exit is busy

the exit is censored

the remote site is down

I think we should measure the relay as a non-exit in this case.

We could cover all these cases by choosing to measure exits as non-exit relays at random. About half the time would work.

I like the idea of recovering from END_REASON_EXITPOLICY better, but the 50% probability idea is waaayyy easier to implement. The former is harder in part because IIRC we currently don't ever look at circuit/stream close reasons, so we should implement doing that in a clean way.

sorry i didn't put this ticket before in needs_review, because:

i thought was just travis, then i realized it needed more work
i was not expecting any of you to look at this now (there're has been tickets last month that nobody reviews in 1 week, and then i merge them myself), thought i'd solve it fast

Trac:
Status: needs_review to needs_revision

We were checking bad exit, exit policy and exit in different ways in different parts of the code. The local domain resolver is only needed to check the IP in the policy, which i don't think is needed, it will fail and be measured later with a probably different exit. With the changes i've made, i check only the port, but check it in all attempts to get an exit. I also found that the port was not not being correctly parsed. The changes solve the problem of having a local domain resolver that fails. I'm running a testing instance and is being faster, failing less circuits.

Trac:
Status: needs_revision to needs_review

I want to check something before reviewing

Trac:
Status: needs_review to needs_revision

Replying to juga:

We were checking bad exit, exit policy and exit in different ways in different parts of the code.

Sounds like you need to refactor the code so you check the flags and exit policy in a single function.

The local domain resolver is only needed to check the IP in the policy, which i don't think is needed, it will fail and be measured later with a probably different exit.

But what happens when the exit that's being measured fails? Do we need to give it a 50% chance of being measured like a non-exit?

With the changes i've made, i check only the port, but check it in all attempts to get an exit. I also found that the port was not not being correctly parsed. The changes solve the problem of having a local domain resolver that fails. I'm running a testing instance and is being faster, failing less circuits.

Ok, sounds good.

Replying to teor:

But what happens when the exit that's being measured fails?

if the exit is the one being measured (1st hop) is treated as any other relay, it stores the result as error, it'll get measured in the next prioritization loop, which gives less priority to the relays that failed. A new exit is chosen for every relay that gets measured, so the next time that the relay that failed gets measured will get an exit with similar bandwidth at random.

Do we need to give it a 50% chance of being measured like a non-exit?

i don't understand what you mean as "non-exit", all relays are measured as non-exits (1st hop), the exits are chosen independently for the 2nd hop.

Edit title and description to match with what is happening

Trac:
Summary: Stop resolving domains locally and stop using non-exits as 2nd hop to Stop resolving domains locally and check same flags for the 2nd hop
Description: To check if an exit can exit to the destination, sbws is trying to resolve the domain locally, which fails in many cases. Even if it does not fail, the IP obtained won't be the same IP to which the exit will make the HTTP request. When the domain resolution is failing, sbws try to choose other relay that does not have the exit flag. If it is not an exit, it will fail to make an HTTP request. sbws will also be faster and give less errors not resolving domains locally.

to

To check if an exit can exit to the destination, sbws is trying to resolve the domain locally, which fails in many cases. Even if it does not fail, the IP obtained won't be the same IP to which the exit will make the HTTP request when using a CND. When the domain resolution is failing, sbws try to choose other exit without restricting that does not have the bad flag and without checking that can exit to an http port. sbws will also be faster and give less errors not resolving domains locally.

Replying to pastly:

When the domain resolution is failing, sbws try to choose other relay that does not have the exit flag. If it is not an exit, it will fail to make an HTTP request.

If DNS fails line 180 then we go to the else block on line 186 where the "second hop" we pick there is an exit.

i've changed the title to describe what was really happening.

Trac:
Status: needs_revision to needs_review

Changes requested in the PR. And now I'm going to repeat myself a bit.

Replying to [comment:6 juga]

I'm running a testing instance and is being faster, failing less circuits.

Okay. Let's do away with resolving domain names locally.

All we need to do to accomplish this is give stem a None in place of a host when asking it to check exit policies.

Replying to [comment:10 juga]

Do we need to give it a 50% chance of being measured like a non-exit? i don't understand what you mean as "non-exit", all relays are measured as non-exits (1st hop), the exits are chosen independently for the 2nd hop.

Right now when sbws measures an exit, it tries to measure it in the 2nd hop position (ignoring DNS failures etc. etc)

With your changes, all relays (exits and non-exits) are measured in the 1st hop position. Why?

I think there's two changes that should be made as part of this ticket, and somehow that's exploded into a lot of little buggy revisions.

Give stem a None for a host
When picking an exit helper (for measuring a non-exit or for measuring an exit that can't exit to our destination port, it doesn't matter), be sure to check that its exit policy allows exiting to our destination port.

Trac:
Status: needs_review to needs_revision

Replying to pastly: [...]

Replying to [comment:10 juga]

Do we need to give it a 50% chance of being measured like a non-exit? i don't understand what you mean as "non-exit", all relays are measured as non-exits (1st hop), the exits are chosen independently for the 2nd hop.

Right now when sbws measures an exit, it tries to measure it in the 2nd hop position (ignoring DNS failures etc. etc)

With your changes, all relays (exits and non-exits) are measured in the 1st hop position. Why?

It was confusing:

what _pick_ideal_second_hop is doing, it is actually not second hop what it picks, but a helper.
is_exit, it can be intepreted as what the relay to measure is, not as what the candidates must be.
the criteria used to pick the 2nd hop (see below). I'll change those names in other ticket, so i don't get confused again.

I think there's two changes that should be made as part of this ticket, and somehow that's exploded into a lot of little buggy revisions.

Solution: When you feel like reviewing a ticket, put yourself as reviewer, otherwise i'll assume nobody is going to review the ticket. I won't put any ticket to needs_review unless it has a reviewer. If it doesn't have reviewer, i might do the refactors that help me to solve the ticket in the same branch (in different commits). If you'd like to continue to review this ticket, please put yourself as reviewer. If not, no need to read more.

When picking an exit helper (for measuring a non-exit or for measuring an exit that can't exit to our destination port, it doesn't matter), be sure to check that its exit policy allows exiting to our destination port.

So, you're saying:

when the helper is in the 2nd hop, check the exit policy.

But currently, you're only checking that it has the exit flag (https://gitweb.torproject.org/sbws.git/tree/sbws/core/scanner.py#n140)

it doesn't matter that the relay does not allow to exit to our destination port or is not an exit

I agree, but currently as result of the else, it has to do not have badexit flag and do not allow to exit to our destination port (https://gitweb.torproject.org/sbws.git/tree/sbws/core/scanner.py#n186)

In the current code, why would you check the policy of an exit and the badexit flag when you use the relay to measure in the 2nd hop and only check the exit flag when the helper is in the 2nd position?

I think we should check that it allows to exit to our destination port in all cases (and no need to check exit flag cause if it can exit, it's an exit).

Would you stop checking that it has not the badexit flag?

Replying to juga:

Replying to pastly:

I think there's two changes that should be made as part of this ticket, and somehow that's exploded into a lot of little buggy revisions.

Solution: When you feel like reviewing a ticket, put yourself as reviewer, otherwise i'll assume nobody is going to review the ticket. I won't put any ticket to needs_review unless it has a reviewer. If it doesn't have reviewer, i might do the refactors that help me to solve the ticket in the same branch (in different commits). If you'd like to continue to review this ticket, please put yourself as reviewer. If not, no need to read more.

In Core Tor/Tor, we set a ticket as needs_review when it is ready for review, then assign a reviewer.

You can use a different process for sbws if you like.

But if you want more reviews from people who usually work on Tor, please use the same review process.

Trac:
Reviewer: N/A to teor

We could also send RESOLVE command, but i think ADDRMAP can only solve IPv4 and i guess there could be the case in which a policy rejects IPv4 but IPv6.

Trac:
Milestone: N/A to sbws 1.0 (MVP must)

ok, trying to solve it in the simpler way: https://github.com/torproject/sbws/pull/294

Trac:
Status: needs_revision to needs_review

pastly, can you do the initial review here? I think you understand this code better.

Moving all sbws 1.0 must bugs to 1.0.3.

Trac:
Milestone: sbws 1.0 (MVP must) to sbws 1.0.3
Keywords: N/A deleted, sbws-1.0-must-moved-20181128 added

Milestone renamed

Trac:
Milestone: sbws 1.0.3 to sbws 1.0.x

Milestone renamed

Trac:
Milestone: sbws 1.0.x to sbws: 1.0.x

Milestone renamed

Trac:
Milestone: sbws: 1.0.x to sbws: 1.0.x-final

As requested. Doing it now.

Trac:
Reviewer: teor to pastly

LGTM.

Recommendations:

remove Relay.can_exit_to(...) after verifying it will be unused upon applying this patch.
remove resolve(...) after verifying it will be unused upon applying this patch.

Your choice on whether to act on those recommendations. merge_ready if you decide to not act.

Trac:
Status: needs_review to merge_ready

I removed what you proposed and created #28706 (moved) to implement resolving IPs using Tor itself, at a later stage. Merged

Trac:
Status: merge_ready to closed
Resolution: N/A to fixed

closed

mentioned in issue #28461 (moved)

mentioned in issue #28463 (moved)

mentioned in issue #28706 (moved)

moved to tpo/network-health/sbws#28458 (closed)

mentioned in issue tpo/network-health/sbws#28706 (closed)

Stop resolving domains locally and check same flags for the 2nd hop

Child items 0

Activity