Opened 6 months ago

Closed 6 months ago

#28458 closed defect (fixed)

Stop resolving domains locally and check same flags for the 2nd hop

Reported by: juga Owned by: juga
Priority: Medium Milestone: sbws: 1.0.x-final
Component: Core Tor/sbws Version: sbws: 1.0.0
Severity: Normal Keywords: sbws-1.0-must-moved-20181128
Cc: pastly, juga Actual Points:
Parent ID: Points:
Reviewer: pastly Sponsor:

Description (last modified by juga)

To check if an exit can exit to the destination, sbws is trying to resolve the domain locally, which fails in many cases.
Even if it does not fail, the IP obtained won't be the same IP to which the exit will make the HTTP request when using a CND.
When the domain resolution is failing, sbws try to choose other exit without restricting that does not have the bad flag and without checking that can exit to an http port.
sbws will also be faster and give less errors not resolving domains locally.

Child Tickets

TicketStatusOwnerSummaryComponent
#28461closedEncourage sbws operators to install a local caching resolverCore Tor/sbws
#28471closedRefactor sbws exit checking so it is consistentCore Tor/sbws

Change History (28)

comment:1 Changed 6 months ago by juga

Status: assignedneeds_review

comment:2 Changed 6 months ago by pastly

sbws is trying to resolve the domain locally, which fails in many cases.

Really? That seems like a problem. Is the system's resolver having issues? Should sbws cache good results it gets back?

Even if it does not fail, the IP obtained won't be the same IP to which the exit will make the HTTP request.

This could be the case, but isn't necessarily the case. For simple destinations (simple like a single webserver as opposed to complex like a CDN) it's most likely not the case that the IPs will be different.

When the domain resolution is failing, sbws try to choose other relay that does not have the exit flag. If it is not an exit, it will fail to make an HTTP request.

If DNS fails line 180 then we go to the else block on line 186 where the "second hop" we pick there is an exit.

So if we are trying to measure an exit and DNS fails, we treat the exit as a non-exit and find an exit to help measure it. This may not be ideal, but it works.

I don't see what you see: I don't see where sbws chooses a non-exit and then tries to use it as the last hop in a circuit.

Right now I don't like these changes and don't agree with merging them. I will also leave comments on the PR, but acknowledging/fixing just them does not mean I think the code is ready to be merged.

comment:3 in reply to:  2 ; Changed 6 months ago by teor

Replying to pastly:

sbws is trying to resolve the domain locally, which fails in many cases.

Really? That seems like a problem. Is the system's resolver having issues? Should sbws cache good results it gets back?

If the system resolver can't answer DNS queries, then the operator should install a caching / recursive local resolver. See #28461.

sbws *could* cache results, but a proper DNS resolver is going to be better at DNS than any sbws code we can write.

Even if it does not fail, the IP obtained won't be the same IP to which the exit will make the HTTP request.

This could be the case, but isn't necessarily the case. For simple destinations (simple like a single webserver as opposed to complex like a CDN) it's most likely not the case that the IPs will be different.

Even if the IP address is different, the content should be the same: only the latency would vary.

So if we are trying to measure an exit and DNS fails, we treat the exit as a non-exit and find an exit to help measure it. This may not be ideal, but it works.

What happens if sbws gets END_REASON_EXITPOLICY from the exit?
(END_REASON_EXITPOLICY is the response when an exit's policy won't allow the IP address that the exit resolved.)

This can happen because:

  • the exit has just changed its policy, and sbws has an old version
  • the exit resolves a different IP address from sbws

I think we should measure the relay as a non-exit in this case.

What happens if the exit can't connect because the connection is refused, or times out?

This can happen because:

  • the exit is busy
  • the exit is censored
  • the remote site is down

I think we should measure the relay as a non-exit in this case.

We could cover all these cases by choosing to measure exits as non-exit relays at random. About half the time would work.

comment:4 in reply to:  3 Changed 6 months ago by pastly

Replying to teor:

What happens if sbws gets END_REASON_EXITPOLICY from the exit?
(END_REASON_EXITPOLICY is the response when an exit's policy won't allow the IP address that the exit resolved.)

This can happen because:

  • the exit has just changed its policy, and sbws has an old version
  • the exit resolves a different IP address from sbws

I think we should measure the relay as a non-exit in this case.

What happens if the exit can't connect because the connection is refused, or times out?

This can happen because:

  • the exit is busy
  • the exit is censored
  • the remote site is down

I think we should measure the relay as a non-exit in this case.

We could cover all these cases by choosing to measure exits as non-exit relays at random. About half the time would work.

I like the idea of recovering from END_REASON_EXITPOLICY better, but the 50% probability idea is waaayyy easier to implement. The former is harder in part because IIRC we currently don't ever look at circuit/stream close reasons, so we should implement doing that in a clean way.

comment:5 Changed 6 months ago by juga

Status: needs_reviewneeds_revision

sorry i didn't put this ticket before in needs_review, because:

  1. i thought was just travis, then i realized it needed more work
  2. i was not expecting any of you to look at this now (there're has been tickets last month that nobody reviews in 1 week, and then i merge them myself), thought i'd solve it fast

comment:6 Changed 6 months ago by juga

We were checking bad exit, exit policy and exit in different ways in different parts of the code.
The local domain resolver is only needed to check the IP in the policy, which i don't think is needed, it will fail and be measured later with a probably different exit.
With the changes i've made, i check only the port, but check it in all attempts to get an exit.
I also found that the port was not not being correctly parsed.
The changes solve the problem of having a local domain resolver that fails.
I'm running a testing instance and is being faster, failing less circuits.

comment:7 Changed 6 months ago by juga

Status: needs_revisionneeds_review

comment:8 Changed 6 months ago by juga

Status: needs_reviewneeds_revision

I want to check something before reviewing

comment:9 in reply to:  6 ; Changed 6 months ago by teor

Replying to juga:

We were checking bad exit, exit policy and exit in different ways in different parts of the code.

Sounds like you need to refactor the code so you check the flags and exit policy in a single function.

The local domain resolver is only needed to check the IP in the policy, which i don't think is needed, it will fail and be measured later with a probably different exit.

But what happens when the exit that's being measured fails?
Do we need to give it a 50% chance of being measured like a non-exit?

With the changes i've made, i check only the port, but check it in all attempts to get an exit.
I also found that the port was not not being correctly parsed.
The changes solve the problem of having a local domain resolver that fails.
I'm running a testing instance and is being faster, failing less circuits.

Ok, sounds good.

comment:10 in reply to:  9 Changed 6 months ago by juga

Replying to teor:

But what happens when the exit that's being measured fails?

if the exit is the one being measured (1st hop) is treated as any other relay,
it stores the result as error, it'll get measured in the next prioritization loop, which gives less priority to the relays that failed.
A new exit is chosen for every relay that gets measured, so the next time that the relay that failed gets measured will get an exit with similar bandwidth at random.

Do we need to give it a 50% chance of being measured like a non-exit?

i don't understand what you mean as "non-exit", all relays are measured as non-exits (1st hop), the exits are chosen independently for the 2nd hop.

comment:11 Changed 6 months ago by juga

Description: modified (diff)
Summary: Stop resolving domains locally and stop using non-exits as 2nd hopStop resolving domains locally and check same flags for the 2nd hop

Edit title and description to match with what is happening

comment:12 in reply to:  2 Changed 6 months ago by juga

Replying to pastly:

When the domain resolution is failing, sbws try to choose other relay that does not have the exit flag. If it is not an exit, it will fail to make an HTTP request.

If DNS fails line 180 then we go to the else block on line 186 where the "second hop" we pick there is an exit.

i've changed the title to describe what was really happening.

comment:13 Changed 6 months ago by juga

Status: needs_revisionneeds_review

comment:14 Changed 6 months ago by pastly

Status: needs_reviewneeds_revision

Changes requested in the PR. And now I'm going to repeat myself a bit.

Replying to juga

I'm running a testing instance and is being faster, failing less circuits.

Okay. Let's do away with resolving domain names locally.

All we need to do to accomplish this is give stem a None in place of a host when asking it to check exit policies.

Replying to juga

Do we need to give it a 50% chance of being measured like a non-exit?

i don't understand what you mean as "non-exit", all relays are measured as non-exits (1st hop), the exits are chosen independently for the 2nd hop.

Right now when sbws measures an exit, it tries to measure it in the 2nd hop position (ignoring DNS failures etc. etc)

With your changes, all relays (exits and non-exits) are measured in the 1st hop position. Why?


I think there's two changes that should be made as part of this ticket, and somehow that's exploded into a lot of little buggy revisions.

  1. Give stem a None for a host
  2. When picking an exit helper (for measuring a non-exit or for measuring an exit that can't exit to our destination port, *it doesn't matter*), be sure to check that its exit policy allows exiting to our destination port.
Last edited 6 months ago by pastly (previous) (diff)

comment:15 in reply to:  14 ; Changed 6 months ago by juga

Replying to pastly:
[...]

Replying to juga

Do we need to give it a 50% chance of being measured like a non-exit?

i don't understand what you mean as "non-exit", all relays are measured as non-exits (1st hop), the exits are chosen independently for the 2nd hop.

Right now when sbws measures an exit, it tries to measure it in the 2nd hop position (ignoring DNS failures etc. etc)

With your changes, all relays (exits and non-exits) are measured in the 1st hop position. Why?

It was confusing:

  • what _pick_ideal_second_hop is doing, it is actually not second hop what it picks, but a helper.
  • is_exit, it can be intepreted as what the relay to measure is, not as what the candidates must be.
  • the criteria used to pick the 2nd hop (see below).

I'll change those names in other ticket, so i don't get confused again.


I think there's two changes that should be made as part of this ticket, and somehow that's exploded into a lot of little buggy revisions.

Solution:
When you feel like reviewing a ticket, put yourself as reviewer, otherwise i'll assume nobody is going to review the ticket.
I won't put any ticket to needs_review unless it has a reviewer.
If it doesn't have reviewer, i might do the refactors that help me to solve the ticket in the same branch (in different commits).
If you'd like to continue to review this ticket, please put yourself as reviewer. If not, no need to read more.

  1. When picking an exit helper (for measuring a non-exit or for measuring an exit that can't exit to our destination port, *it doesn't matter*), be sure to check that its exit policy allows exiting to our destination port.

So, you're saying:

  1. when the helper is in the 2nd hop, check the exit policy.

But currently, you're only checking that it has the exit flag (https://gitweb.torproject.org/sbws.git/tree/sbws/core/scanner.py#n140)

  1. it doesn't matter that the relay does *not* allow to exit to our destination port or is not an exit

I agree, but currently as result of the else, it has to do *not* have badexit flag and do *not* allow to exit to our destination port (https://gitweb.torproject.org/sbws.git/tree/sbws/core/scanner.py#n186)

In the current code, why would you check the policy of an exit and the badexit flag when you use the relay to measure in the 2nd hop and only check the exit flag when the helper is in the 2nd position?

I think we should check that it allows to exit to our destination port in all cases (and no need to check exit flag cause if it can exit, it's an exit).

Would you stop checking that it has *not* the badexit flag?

comment:16 in reply to:  15 Changed 6 months ago by teor

Replying to juga:

Replying to pastly:

I think there's two changes that should be made as part of this ticket, and somehow that's exploded into a lot of little buggy revisions.

Solution:
When you feel like reviewing a ticket, put yourself as reviewer, otherwise i'll assume nobody is going to review the ticket.
I won't put any ticket to needs_review unless it has a reviewer.
If it doesn't have reviewer, i might do the refactors that help me to solve the ticket in the same branch (in different commits).
If you'd like to continue to review this ticket, please put yourself as reviewer. If not, no need to read more.

In Core Tor/Tor, we set a ticket as needs_review when it is ready for review, then assign a reviewer.

You can use a different process for sbws if you like.

But if you want more reviews from people who usually work on Tor, please use the same review process.

comment:17 Changed 6 months ago by teor

Reviewer: teor

comment:18 Changed 6 months ago by juga

We could also send RESOLVE command, but i think ADDRMAP can only solve IPv4 and
i guess there could be the case in which a policy rejects IPv4 but IPv6.

comment:19 Changed 6 months ago by teor

Milestone: sbws 1.0 (MVP must)

comment:20 Changed 6 months ago by juga

Status: needs_revisionneeds_review

ok, trying to solve it in the simpler way: https://github.com/torproject/sbws/pull/294

comment:21 Changed 6 months ago by teor

pastly, can you do the initial review here?
I think you understand this code better.

comment:22 Changed 6 months ago by teor

Keywords: sbws-1.0-must-moved-20181128 added
Milestone: sbws 1.0 (MVP must)sbws 1.0.3

Moving all sbws 1.0 must bugs to 1.0.3.

comment:23 Changed 6 months ago by teor

Milestone: sbws 1.0.3sbws 1.0.x

Milestone renamed

comment:24 Changed 6 months ago by teor

Milestone: sbws 1.0.xsbws: 1.0.x

Milestone renamed

comment:25 Changed 6 months ago by teor

Milestone: sbws: 1.0.xsbws: 1.0.x-final

Milestone renamed

comment:26 Changed 6 months ago by pastly

Reviewer: teorpastly

As requested. Doing it now.

comment:27 Changed 6 months ago by pastly

Status: needs_reviewmerge_ready

LGTM.

Recommendations:

  • remove Relay.can_exit_to(...) after verifying it will be unused upon applying this patch.
  • remove resolve(...) after verifying it will be unused upon applying this patch.

Your choice on whether to act on those recommendations. merge_ready if you decide to not act.

comment:28 Changed 6 months ago by juga

Resolution: fixed
Status: merge_readyclosed

I removed what you proposed and created #28706 to implement resolving IPs using Tor itself, at a later stage.
Merged

Note: See TracTickets for help on using tickets.