Opened 10 months ago

Closed 6 months ago

Last modified 5 months ago

#28547 closed defect (fixed)

Monitor relays that are not measured by each sbws instance

Reported by: teor Owned by: juga
Priority: Medium Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version:
Severity: Normal Keywords: tor-bwauth, sbws-1.0-must-moved-20181128, sbws-11x-final-removed-20190312, sbws-110-proposed, changes-version-minor
Cc: Actual Points:
Parent ID: #25925 Points:
Reviewer: Sponsor:

Description

If sbws isn't measuring a relay, either:

  • the relay is misconfigured, or
  • there is a bug in sbws (like #28519)

We should decide how long it should take sbws to measure a relay. Then we should create a list of the relays that aren't measured by sbws.

When we work out why a relay isn't being measured by sbws, we should contact the operator (or fix the bug).

Child Tickets

TicketStatusOwnerSummaryComponent
#28563closedjugaWork out how sbws can report excluded relays in the bandwidth fileCore Tor/sbws
#28564closedWhen a relay is excluded, add an exclusion reason to the relay's bandwidth file lineCore Tor/sbws
#28565closedjugaReport excluded results in a relay's bandwidth lineCore Tor/sbws
#28566closedjugaReport relays that sbws sees in the consensus, but never chooses to testCore Tor/sbws
#28567closedjugaReport relays that sbws wants to test, but the test doesn't workCore Tor/sbws
#28652closedjugaWhen sbws stops making progress, log a warningCore Tor/sbws
#29591closedAdd KeyValues to monitor relays that are not measuredCore Tor/sbws
#29754closedteorInclude new monitoring KeyValues in the bandwidth-file-specCore Tor/Tor
#29775closedjugaDocument the new bandwidth file keys added in sbws 1.1.0Core Tor/sbws
#29854closedjugaMissing diagnostic keys in relay lines, but the data is in the headerCore Tor/sbws

Change History (16)

comment:1 Changed 10 months ago by teor

Parent ID: #22453#25925

comment:2 Changed 10 months ago by juga

I'm using the same pad you created for #28563 to think also on the KeyValues we would need and where they could be obtained from, which is related to the children tickets.

comment:3 in reply to:  2 Changed 10 months ago by teor

Replying to juga:

I'm using the same pad you created for #28563 to think also on the KeyValues we would need and where they could be obtained from, which is related to the children tickets.

I want to do an initial release with a few extra keys. When we look at the results, we can decide which failures we need to focus on.

Here are the questions I want to answer in the initial release:

  1. Is sbws getting enough consensuses? Is this relay missing from consensuses?
  2. Is sbws putting enough relays in priority lists? Is this relay in enough priority lists?
  3. Is sbws measuring enough relays? Is this relay getting measured enough?
  4. Is sbws including enough results? Does this relay have enough included results?

Here are my draft key rules:

  1. Each new key should add new information, that can't be calculated from other keys
  2. Keys that contain repeated information should be in the header.
  3. The good case and the bad case should be easy for humans to see.
  4. Add a small number of keys to identify problems. Then add more keys to diagnose the problems.

I also have another rule just for this ticket:

We should ignore everything that is older than 5 days. Because anything older than 5 days can't affect the result. It will be easier to see recent issues if we exclude old data.

comment:4 Changed 10 months ago by juga

The KeyValues that depend on the scanner (not the generator), would be only written once a day, so we would need to look at the bandwidth files generated at 00:35 UTC. Any potential problem with that?.

comment:5 in reply to:  4 ; Changed 10 months ago by teor

Replying to juga:

The KeyValues that depend on the scanner (not the generator), would be only written once a day, so we would need to look at the bandwidth files generated at 00:35 UTC. Any potential problem with that?.

Why can't we report those values every hour, for the last hour?

comment:6 Changed 10 months ago by teor

Keywords: sbws-1.0-must-moved-20181128 added
Milestone: sbws 1.0 (MVP must)sbws 1.0.4

Moving all sbws 1.0 must planning and feature tickets to 1.0.4.

comment:7 Changed 10 months ago by teor

Milestone: sbws 1.0.4sbws 1.1

Milestone renamed

comment:8 Changed 10 months ago by teor

Milestone: sbws 1.1sbws: 1.1.x

Milestone renamed

comment:9 Changed 10 months ago by teor

Milestone: sbws: 1.1.xsbws: 1.1.x-final

Milestone renamed

comment:10 in reply to:  5 ; Changed 10 months ago by juga

Replying to teor:

Replying to juga:

The KeyValues that depend on the scanner (not the generator), would be only written once a day, so we would need to look at the bandwidth files generated at 00:35 UTC. Any potential problem with that?.

Why can't we report those values every hour, for the last hour?

because if we want to report them when running generate, generate reads the files that the scanner produces, which are created only once a day.
Two solutions to this:

  1. change the callback that dumps the result files to do it every hour.
    • pro: easy change
    • con: the new keyvalues we want to add need to go first to the results files (and create new error types for then), then read from generate
  2. generate could be other thread that happens every hour, instead of a different process, so that it can access to the results without the need to read them back from the results files.
    • pro: eliminate the need to have to run an external command, to have to write first the results files and then read them again
    • con: bigger change

I'm a bit more inclinated to 2, because that would easy further refactorings for

  • not having all bandwidth values triplicated in v3bwfile, relaylist and resultdump. I can explain more about htis
  • not having to create new ResultError classes to monitor the relays

comment:11 in reply to:  10 ; Changed 10 months ago by teor

Replying to juga:

Replying to teor:

Replying to juga:

The KeyValues that depend on the scanner (not the generator), would be only written once a day, so we would need to look at the bandwidth files generated at 00:35 UTC. Any potential problem with that?.

Why can't we report those values every hour, for the last hour?

because if we want to report them when running generate, generate reads the files that the scanner produces, which are created only once a day.

How does generate create a new bandwidth file every hour, if the scanner only dumps results once a day?
Is there a document that explains this design?

Two solutions to this:

  1. change the callback that dumps the result files to do it every hour.
    • pro: easy change
    • con: the new keyvalues we want to add need to go first to the results files (and create new error types for then), then read from generate
  • pro: generate can produce accurate results, even if the scanner crashes or is restarted during the day
  1. generate could be other thread that happens every hour, instead of a different process, so that it can access to the results without the need to read them back from the results files.
    • pro: eliminate the need to have to run an external command, to have to write first the results files and then read them again
    • con: bigger change
  • con: if sbws restarts, some results for the day are lost
  • con: is this a breaking change? Does the command-line interface to generate change?

I'm a bit more inclinated to 2, because that would easy further refactorings for

  • not having all bandwidth values triplicated in v3bwfile, relaylist and resultdump. I can explain more about htis

Why is this a problem?

  • not having to create new ResultError classes to monitor the relays

sbws should make it easy to add new keys. If it's not easy, we should re-design the code so it is easier.

Here's what I think:

  • sbws needs to persist the results every hour, so that it can read them after a restart or crash. Otherwise, we lose a day of data every time sbws restarts.
  • as long as sbws can resume from the last hour's results, the implementation doesn't matter. Do the easy, simple thing.

comment:12 in reply to:  11 Changed 10 months ago by juga

Replying to teor:

  1. generate could be other thread that happens every hour, instead of a different process, so that it can access to the results without the need to read them back from the results files.
    • pro: eliminate the need to have to run an external command, to have to write first the results files and then read them again
    • con: bigger change
  • con: if sbws restarts, some results for the day are lost

reusults can be also written as they're gotten

  • con: is this a breaking change? Does the command-line interface to generate change?

the command generate could still exist, just no need for it if the scanner command also do it.

I'm a bit more inclinated to 2, because that would easy further refactorings for

  • not having all bandwidth values triplicated in v3bwfile, relaylist and resultdump. I can explain more about htis

Why is this a problem?

It's currently just a bit hard to modify and maintain

  • not having to create new ResultError classes to monitor the relays

sbws should make it easy to add new keys. If it's not easy, we should re-design the code so it is easier.

ok, will see what's the easier way

Here's what I think:

  • sbws needs to persist the results every hour, so that it can read them after a restart or crash. Otherwise, we lose a day of data every time sbws restarts.

my mistake here, the results are written as they are obtained: https://gitweb.torproject.org/sbws.git/tree/sbws/core/scanner.py#n353.

So there's no need for 1. and 2. could be an improvement, but i think it's not needed so far

comment:13 Changed 7 months ago by gaba

Owner: set to juga
Status: newassigned

comment:14 Changed 7 months ago by juga

Keywords: sbws-11x-final-removed-20190312 sbws-110-proposed changes-version-minor added
Milestone: sbws: 1.1.x-finalsbws: 1.1.0

Move tickets that imply a minor version change to 1.1.0.

comment:15 Changed 6 months ago by juga

Cc: pastly juga removed
Resolution: fixed
Status: assignedclosed

All children are closed now.

comment:16 Changed 5 months ago by teor

Milestone: sbws: 1.1.0sbws: 1.1.x-final

Move sbws 1.1.0 tickets into 1.1.x-final, to match Tor's milestone scheme.

Note: See TracTickets for help on using tickets.