Opened 7 months ago

Last modified 3 months ago

#29710 new defect

sbws reports fewer relays than torflow

Reported by: starlight Owned by:
Priority: Medium Milestone: sbws: unspecified
Component: Core Tor/sbws Version: sbws: unspecified
Severity: Normal Keywords:
Cc: juga, pastly Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by teor)

Is it acceptable that SBWS consistently reports 6200 relays, 1000 fewer than Torflow's 7200 universe?

Question should be answered. #28547 closest but this ticket does not directly address the issue.

Edit: sbws reports running relays, but torflow reports measured relays

Child Tickets

TicketStatusOwnerSummaryComponent
#30226newWork out why 4% of sbws measurements are excluded due to errorsCore Tor/sbws
#30227newWork out why 3% of sbws measurements are excluded because the relay only has 1 measurementCore Tor/sbws
#30228newWork out why 1% of sbws measurements are excluded because they are all in the same dayCore Tor/sbws
#30230newWork out what sbws is doing in its measurement threadsCore Tor/sbws
#30719newWork out why 90% of sbws measurements failCore Tor/sbws
#30723newWork out why no measurements are excluded because they are too oldCore Tor/sbws
#30724newWork out why relay_in_recent_consensus_count is 13 days for some relaysCore Tor/sbws
#30725newDoes "success" record relay measurements older than 5 days?Core Tor/sbws
#30727newMake sbws vote for all measured relays, even if they are not Running / not in the consensusCore Tor/sbws

Attachments (5)

sbws_not_reported_vs_torflow_20190309.txt (47.7 KB) - added by starlight 7 months ago.
sbws_bastet_not_longclaw.txt (12.3 KB) - added by starlight 7 months ago.
sbws_longclaw_not_bastet.txt (16.6 KB) - added by starlight 7 months ago.
sbws_bastet_not_reported_vs_torflow_maatuska.txt (47.9 KB) - added by starlight 7 months ago.
sbws_bastet_not_reported_vs_torflow_maatuska_and_in_consensus.txt (30.0 KB) - added by starlight 7 months ago.

Download all attachments as: .zip

Change History (24)

Changed 7 months ago by starlight

Changed 7 months ago by starlight

Changed 7 months ago by starlight

comment:2 Changed 7 months ago by starlight

Adding further, SBWS bastet reports 300 relays not reported by SBWS longclaw which reports 400 relays not reported by SBWS bastet. Longclaw and bastet vote are from same consensus:

https://collector.torproject.org/recent/relay-descriptors/votes/2019-03-09-17-00-00-vote-23D15D965BC35114467363C165C4F724B64B4F66-31CFF7525A8A437FFD171E5AF2447C5FC959FD00

Changed 7 months ago by starlight

comment:3 Changed 7 months ago by starlight

Looking at maatuska vote in the same 17:00 UTC consensus, SBWS bastet lists 1165 fewer relays. Filtering to the live consensus at the time, deficit is 723 relays comprising 2.5% by overall consensus weight.

comment:4 Changed 7 months ago by juga

Correct version

comment:5 Changed 7 months ago by juga

We are aware of that, this is a kind of duplicate of #28355, though the ticket subject might not be clear.
I think the decision was to wait until #28547 is implemented, to confirm the reasons why sbws is reporting less relays.

comment:6 Changed 7 months ago by starlight

It is aggressive to deploy more than one scanner in production before this issue is decisively resolved. Certainly the state of a second non-publishing scanner can be quality assured. . .without pushing results to the authorities.

comment:7 Changed 6 months ago by teor

Milestone: sbws: unspecified

Moving sbws tickets without a milestone to sbws: unspecified.

comment:8 Changed 6 months ago by teor

Milestone: sbws: unspecified
Status: newneeds_information

Hi, we deployed sbws 1.1.0 to bastet and longclaw.
sbws 1.1.0 fixes some bugs, and adds extra error reporting to the bandwidth file.
(It generates file format version 1.4.0.)

Can you please re-do your analysis with the latest votes and bandwidth files from longclaw?

The bandwidth file for longclaw is available at:
http://199.58.81.140/tor/status-vote/next/bandwidth

sbws now reports all of the relays in the bandwidth file.
Some relays will be excluded from the vote, their bandwidth lines contain "vote=0".
https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt#n886

comment:9 Changed 6 months ago by nickm

Milestone: sbws: unspecified

comment:10 Changed 6 months ago by juga

What i've being observing for some months and now is public:

recent_measurements_excluded_error_count=763
recent_measurements_excluded_few_count=733
recent_measurements_excluded_near_count=259

The last two makes the ~1000 relays less.
The scanner takes around 48h (i was wrong with my 24h estimation) to measure unique relays in the consensus, so it takes 4 days for each relay to have at least 2 measures (and not be excluded by few) and we're only considering 5.
There would be less relays excluded if we take only 1 measurement as valid or we consider more days of measurements.
I don't have an explanation yet for the relays near that get measured again in less than 24h.
A different thing for which i'll open a ticket as soon as i confirm, is that the number of the consensuses where the relay has been seen seems to be only 1, which doesn't make sense.

comment:11 in reply to:  10 ; Changed 6 months ago by teor

It looks like there might be a few bugs here.
Let's discuss each bug in a separate ticket.

Replying to juga:

What i've being observing for some months and now is public:

recent_measurements_excluded_error_count=763
recent_measurements_excluded_few_count=733
recent_measurements_excluded_near_count=259

Here's a ticket for the errors: #30226.

The last two makes the ~1000 relays less.
The scanner takes around 48h (i was wrong with my 24h estimation) to measure unique relays in the consensus, so it takes 4 days for each relay to have at least 2 measures (and not be excluded by few) and we're only considering 5.
There would be less relays excluded if we take only 1 measurement as valid or we consider more days of measurements.

Here's a ticket for the few: #30227.
I will write more on that ticket.

I don't have an explanation yet for the relays near that get measured again in less than 24h.

Here's a ticket for the near: #30228.

A different thing for which i'll open a ticket as soon as i confirm, is that the number of the consensuses where the relay has been seen seems to be only 1, which doesn't make sense.

What's the ticket number?

comment:12 Changed 6 months ago by teor

Summary: Is it acceptable that SBWS consistently reports 6200 relays, 1000 fewer than Torflow's 7200 universe?sbws reports 6200 relays, 1000 fewer than Torflow's 7200

Changing title for readability

comment:13 Changed 5 months ago by teor

Hi, the answer to this question is:

sbws only reports bandwidths for Running relays, but torflow reports bandwidths for all relays it has recently measured.

Here is the evidence:

longclaw's bandwidth file says:

1559468088
number_consensus_relays=6552
number_eligible_relays=6328
percent_eligible_relays=97

http://199.58.81.140/tor/status-vote/next/bandwidth

moria1's bandwidth file says:

$ curl http://128.31.0.34:9131/tor/status-vote/next/bandwidth | wc -l
8955

consensus health says:

faravahar 	7490 total 	6576 Running
consensus 			6603 Running

https://consensus-health.torproject.org/#numberofrelays

Our next step is to fix the 90% failure rate in #30719. That might improve longclaw's measurement rate above 97%.

comment:14 Changed 5 months ago by teor

Description: modified (diff)
Status: needs_informationnew
Summary: sbws reports 6200 relays, 1000 fewer than Torflow's 7200sbws reports running relays, but torflow reports measured relays

comment:15 Changed 5 months ago by starlight

Torflow approach on this is correct. Relays may go up and down due to maintenance, network outages and other normal occurrences. Authorities detect the state of relays and include them in the consensus as appropriate. Bandwidth scanners should report what they know absent current run state. The negative consequence of omitting down relays is delay in proper rating when they come back online.

comment:16 in reply to:  15 Changed 5 months ago by teor

Replying to starlight:

Torflow approach on this is correct. Relays may go up and down due to maintenance, network outages and other normal occurrences. Authorities detect the state of relays and include them in the consensus as appropriate. Bandwidth scanners should report what they know absent current run state. The negative consequence of omitting down relays is delay in proper rating when they come back online.

Yes, I agree. We'll fix this issue in #30727. We need to keep at least 3 torflow instances, until we do this fix.

comment:17 Changed 5 months ago by teor

Description: modified (diff)
Summary: sbws reports running relays, but torflow reports measured relayssbws reports fewer relays than torflow

comment:18 in reply to:  11 Changed 4 months ago by juga

Replying to teor:

A different thing for which i'll open a ticket as soon as i confirm, is that the number of the consensuses where the relay has been seen seems to be only 1, which doesn't make sense.

What's the ticket number?

i don't know what i was thinking on when i wrote this, but didn't open any ticket. Relays seen in only 1 consensus make sense, like relays that just joined the network or are up again after sbws lost their previous data.

comment:19 Changed 3 months ago by pastly

Cc: pastly added
Note: See TracTickets for help on using tickets.