#32265 closed task (fixed)

MS: Format an exit list from a previous exit list and exitmap output

Reported by: irl Owned by: irl
Priority: Medium Milestone:
Component: Metrics/Exit Scanner Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: #29654 Points:
Reviewer: karsten Sponsor:

Description

From ndjson formatted PATHspider results, rewrite them as an exit list that can be consumed by check.tpo.

Child Tickets

Attachments (3)

output (185.7 KB) - added by irl 13 months ago.
example output
merge.py (2.3 KB) - added by irl 13 months ago.
merged (152.6 KB) - added by irl 13 months ago.
formatted exit list

Download all attachments as: .zip

Change History (17)

comment:1 Changed 13 months ago by irl

Summary: MS: Format an exit list from PATHspider resultsMS: Format an exit list from a previous exit list and exitmap output

exitmap can do this

Changed 13 months ago by irl

Attachment: output added

example output

Changed 13 months ago by irl

Attachment: merge.py added

Changed 13 months ago by irl

Attachment: merged added

formatted exit list

comment:2 Changed 13 months ago by irl

Cc: metrics-team added
Owner: changed from metrics-team to irl
Reviewer: karsten
Status: newaccepted

This desperately needs a local cache, but that's most of the logic. You currently have to hand crank the system if you want to run it:

  • run exitmap with the plugin from #32264 and save the log file
  • run merge.py and save the output

comment:3 Changed 13 months ago by irl

Status: acceptedneeds_review

comment:4 Changed 13 months ago by karsten

Reviewing now...

comment:5 Changed 13 months ago by karsten

Status: needs_reviewneeds_revision

Glad to see that the rewrite is progressing so quickly!

Couple remarks/questions:

  • Why 48 hours and not 24 hours? Doesn't the current exit scanner keep scan results for 24 hours? I might be wrong, though. Let's use whatever the current scanner does.
  • Rather than downloading exit lists from CollecTor, wouldn't it be sufficient to just read the latest exit list previously written by this scanner? And if there's none, just assume that no previous scans have happened. In theory, this should be all we need to learn.
  • It seems that LastStatus is only taken from exit lists downloaded from CollecTor but never set by new measurements. We should make a plan what to do with this field. Take it out? Populate it with consensus valid-after times?
  • Does exitmap with the plugin use previous scans as input to decide which relays to scan? I believe that it uses some logic to avoid scanning relays too frequently. This has two effects: it doesn't generate more load on the network and on single relays than necessary, and it ensures that new relays are scanned sooner. As a result, the new scanner could be run once or twice per hour, rather than every 2 or 3 hours (at 45 minutes runtime).

comment:6 in reply to:  5 ; Changed 12 months ago by irl

Replying to karsten:

Glad to see that the rewrite is progressing so quickly!

Couple remarks/questions:

  • Why 48 hours and not 24 hours? Doesn't the current exit scanner keep scan results for 24 hours? I might be wrong, though. Let's use whatever the current scanner does.

https://2019.www.torproject.org/tordnsel/exitlist-spec.txt

It discards relays that were not seen in the last 48 hours in a consensus.

  • Rather than downloading exit lists from CollecTor, wouldn't it be sufficient to just read the latest exit list previously written by this scanner? And if there's none, just assume that no previous scans have happened. In theory, this should be all we need to learn.

Probably, but this was a handy way to get test data and I wanted to try out the new Stem functionality. It would be nice to have a method to bootstrap a new scanner but this could just mean manually downloading the latest exit list and putting it in the right place.

  • It seems that LastStatus is only taken from exit lists downloaded from CollecTor but never set by new measurements. We should make a plan what to do with this field. Take it out? Populate it with consensus valid-after times?

Right, this is the tricky bit. Do you know if anything consumes the LastStatus or Published timestamps? Ideally we could just drop these but for now I'm synthesizing them from the timestamp of the last measurement which could be close enough for the consumers.

  • Does exitmap with the plugin use previous scans as input to decide which relays to scan? I believe that it uses some logic to avoid scanning relays too frequently. This has two effects: it doesn't generate more load on the network and on single relays than necessary, and it ensures that new relays are scanned sooner. As a result, the new scanner could be run once or twice per hour, rather than every 2 or 3 hours (at 45 minutes runtime).

No. It scans the entire network every time. It does this asynchronously, and doesn't try to prioritize anything. Just whichever circuits are built first will be tested first. I was even thinking it could run continuously. If exit relays cannot cope with two HTTP requests an hour, perhaps they shouldn't be exit relays.

comment:7 Changed 12 months ago by irl

Status: needs_revisionneeds_review

comment:8 in reply to:  6 ; Changed 12 months ago by karsten

Replying to irl:

Replying to karsten:

Glad to see that the rewrite is progressing so quickly!

Couple remarks/questions:

  • Why 48 hours and not 24 hours? Doesn't the current exit scanner keep scan results for 24 hours? I might be wrong, though. Let's use whatever the current scanner does.

https://2019.www.torproject.org/tordnsel/exitlist-spec.txt

It discards relays that were not seen in the last 48 hours in a consensus.

Okay, let's use 48 hours then.

  • Rather than downloading exit lists from CollecTor, wouldn't it be sufficient to just read the latest exit list previously written by this scanner? And if there's none, just assume that no previous scans have happened. In theory, this should be all we need to learn.

Probably, but this was a handy way to get test data and I wanted to try out the new Stem functionality. It would be nice to have a method to bootstrap a new scanner but this could just mean manually downloading the latest exit list and putting it in the right place.

Actually, I think it's harmful to download exit lists from CollecTor and merging them with the scanner's own measurements. We should instead merge new scan results with previous local results. It's also yet another dependency to download something from CollecTor that is not really needed. I'd say kill this code.

  • It seems that LastStatus is only taken from exit lists downloaded from CollecTor but never set by new measurements. We should make a plan what to do with this field. Take it out? Populate it with consensus valid-after times?

Right, this is the tricky bit. Do you know if anything consumes the LastStatus or Published timestamps? Ideally we could just drop these but for now I'm synthesizing them from the timestamp of the last measurement which could be close enough for the consumers.

Well, the spec says what these fields are being used for: Published is used to skip relays that haven't published a new descriptor since the one in the current consensus, and LastStatus is used to know when to throw out relays from the list. This is all under the assumption that the scanner reads its previous exit list from disk before making measurements.

My suggestion would be to use the consensus valid-after time as LastStatus time. It's pretty much the same as the published time in a version 2 status, and it would work for this purpose.

  • Does exitmap with the plugin use previous scans as input to decide which relays to scan? I believe that it uses some logic to avoid scanning relays too frequently. This has two effects: it doesn't generate more load on the network and on single relays than necessary, and it ensures that new relays are scanned sooner. As a result, the new scanner could be run once or twice per hour, rather than every 2 or 3 hours (at 45 minutes runtime).

No. It scans the entire network every time. It does this asynchronously, and doesn't try to prioritize anything. Just whichever circuits are built first will be tested first. I was even thinking it could run continuously. If exit relays cannot cope with two HTTP requests an hour, perhaps they shouldn't be exit relays.

Ideally, we would change as few variables at the same time as possible, in order to compare the new results with the old ones. Changing the scheduling from "only scan relays with changed descriptors" to "scan all relays once per hour" seems like a major design change that we could make at a later time.

comment:9 in reply to:  8 ; Changed 12 months ago by irl

Replying to karsten:

Actually, I think it's harmful to download exit lists from CollecTor and merging them with the scanner's own measurements. We should instead merge new scan results with previous local results. It's also yet another dependency to download something from CollecTor that is not really needed. I'd say kill this code.

Ok, it's gone.

Well, the spec says what these fields are being used for: Published is used to skip relays that haven't published a new descriptor since the one in the current consensus, and LastStatus is used to know when to throw out relays from the list. This is all under the assumption that the scanner reads its previous exit list from disk before making measurements.

My suggestion would be to use the consensus valid-after time as LastStatus time. It's pretty much the same as the published time in a version 2 status, and it would work for this purpose.

I saw what TorDNSEL is using it for, but I wonder if people use exit lists in ways we haven't anticipated. I guess we can synthesize the valid after time from the measurement time, but our plugin is not directly handling consensuses or server descriptors. It would take changes to exitmap internals to get this data out.

No. It scans the entire network every time. It does this asynchronously, and doesn't try to prioritize anything. Just whichever circuits are built first will be tested first. I was even thinking it could run continuously. If exit relays cannot cope with two HTTP requests an hour, perhaps they shouldn't be exit relays.

Ideally, we would change as few variables at the same time as possible, in order to compare the new results with the old ones. Changing the scheduling from "only scan relays with changed descriptors" to "scan all relays once per hour" seems like a major design change that we could make at a later time.

This could add a lot of time to the project. The exitmap architecture doesn't really have a way to do this, so it would take changes to the internals there. I guess we can perform the measurements and then throw them away as a shortcut option, but once we've done the measurement anyway that seems wasteful.

comment:10 in reply to:  9 ; Changed 12 months ago by karsten

Replying to irl:

Replying to karsten:

Actually, I think it's harmful to download exit lists from CollecTor and merging them with the scanner's own measurements. We should instead merge new scan results with previous local results. It's also yet another dependency to download something from CollecTor that is not really needed. I'd say kill this code.

Ok, it's gone.

But it's still merging with the last-written local exit list?

Well, the spec says what these fields are being used for: Published is used to skip relays that haven't published a new descriptor since the one in the current consensus, and LastStatus is used to know when to throw out relays from the list. This is all under the assumption that the scanner reads its previous exit list from disk before making measurements.

My suggestion would be to use the consensus valid-after time as LastStatus time. It's pretty much the same as the published time in a version 2 status, and it would work for this purpose.

I saw what TorDNSEL is using it for, but I wonder if people use exit lists in ways we haven't anticipated. I guess we can synthesize the valid after time from the measurement time, but our plugin is not directly handling consensuses or server descriptors. It would take changes to exitmap internals to get this data out.

I don't think we're using it (I'd have to check), nor do I know about others using it. But I'd be careful removing it or filling it with approximately correct data.

Can we somehow access the consensus used for scanning and fill in these fields as part of the merge script? Maybe we can extend exitmap to dump that consensus to disk at the time of making a list of relays to scan?

No. It scans the entire network every time. It does this asynchronously, and doesn't try to prioritize anything. Just whichever circuits are built first will be tested first. I was even thinking it could run continuously. If exit relays cannot cope with two HTTP requests an hour, perhaps they shouldn't be exit relays.

Ideally, we would change as few variables at the same time as possible, in order to compare the new results with the old ones. Changing the scheduling from "only scan relays with changed descriptors" to "scan all relays once per hour" seems like a major design change that we could make at a later time.

This could add a lot of time to the project. The exitmap architecture doesn't really have a way to do this, so it would take changes to the internals there. I guess we can perform the measurements and then throw them away as a shortcut option, but once we've done the measurement anyway that seems wasteful.

I see. Then let's keep this in mind when comparing results. (This is mostly a note to myself. ;))

One question though: If scanning takes 45 minutes right now, can we schedule scans in a way that they will still work when scanning takes 75 minutes (larger network) or 15 minutes (fewer/faster exits)? For example, we should avoid concurrent runs, and if we do scans continuously, we should avoid too frequent scans.

comment:11 in reply to:  10 ; Changed 12 months ago by irl

Replying to karsten:

Replying to irl:

Replying to karsten:

Actually, I think it's harmful to download exit lists from CollecTor and merging them with the scanner's own measurements. We should instead merge new scan results with previous local results. It's also yet another dependency to download something from CollecTor that is not really needed. I'd say kill this code.

Ok, it's gone.

But it's still merging with the last-written local exit list?

Yes, it keeps a few on disk like OnionPerf does, but only reads the last one.

I don't think we're using it (I'd have to check), nor do I know about others using it. But I'd be careful removing it or filling it with approximately correct data.

Can we somehow access the consensus used for scanning and fill in these fields as part of the merge script? Maybe we can extend exitmap to dump that consensus to disk at the time of making a list of relays to scan?

We can make that change, but I'd say it is not a priority until we're further along. We still have to fix up check and the DNS server and if all the time is spent on the scanner we still end up with a broken service.

One question though: If scanning takes 45 minutes right now, can we schedule scans in a way that they will still work when scanning takes 75 minutes (larger network) or 15 minutes (fewer/faster exits)? For example, we should avoid concurrent runs, and if we do scans continuously, we should avoid too frequent scans.

while True:
    start = now()
    run_scanner()
    while now() < start + minutes(40):
        pass

comment:12 in reply to:  11 Changed 12 months ago by karsten

Replying to irl:

Replying to karsten:

Replying to irl:

Replying to karsten:

Actually, I think it's harmful to download exit lists from CollecTor and merging them with the scanner's own measurements. We should instead merge new scan results with previous local results. It's also yet another dependency to download something from CollecTor that is not really needed. I'd say kill this code.

Ok, it's gone.

But it's still merging with the last-written local exit list?

Yes, it keeps a few on disk like OnionPerf does, but only reads the last one.

Great!

I don't think we're using it (I'd have to check), nor do I know about others using it. But I'd be careful removing it or filling it with approximately correct data.

Can we somehow access the consensus used for scanning and fill in these fields as part of the merge script? Maybe we can extend exitmap to dump that consensus to disk at the time of making a list of relays to scan?

We can make that change, but I'd say it is not a priority until we're further along. We still have to fix up check and the DNS server and if all the time is spent on the scanner we still end up with a broken service.

Agreed. Please put this on the list somewhere, so that we don't forget.

One question though: If scanning takes 45 minutes right now, can we schedule scans in a way that they will still work when scanning takes 75 minutes (larger network) or 15 minutes (fewer/faster exits)? For example, we should avoid concurrent runs, and if we do scans continuously, we should avoid too frequent scans.

while True:
    start = now()
    run_scanner()
    while now() < start + minutes(40):
        pass

Cool!

comment:13 Changed 11 months ago by irl

Status: needs_reviewaccepted

comment:14 Changed 10 months ago by irl

Resolution: fixed
Status: acceptedclosed

This is completed.

Note: See TracTickets for help on using tickets.