Opened 3 weeks ago

Last modified 7 days ago

#30733 needs_revision defect

sbws does not detect changes in descriptor bandwidth values

Reported by: starlight Owned by: juga
Priority: Very High Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version: sbws: 1.1.0
Severity: Critical Keywords: sbws-majority-blocker
Cc: juga Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by teor)

See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.

Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples.

Child Tickets

TicketStatusOwnerSummaryComponent
#30747needs_reviewUnclear check for skipping scaling due to missing bandwidths or missing descriptorsCore Tor/sbws
#30908closedteorrace condition: new_relays_dict is modified during iterationCore Tor/sbws
#30909needs_reviewteorsbws consensus timestamp updates incorrectly use the current timeCore Tor/sbws

Attachments (5)

powerlay_longclaw_series.txt (26.2 KB) - added by starlight 3 weeks ago.
powerlay_bastet_series.txt (14.6 KB) - added by starlight 3 weeks ago.
powerlay_maatuska_series.txt (25.8 KB) - added by starlight 3 weeks ago.
sbws_longclaw_30733_fail.txt (206.3 KB) - added by starlight 2 weeks ago.
list of relays where longclaw descriptor differs more than 7% from consensus
longclaw_sbws_unchanged_bw_descriptors.txt (276.9 KB) - added by starlight 12 days ago.

Download all attachments as: .zip

Change History (37)

Changed 3 weeks ago by starlight

Changed 3 weeks ago by starlight

Attachment: powerlay_bastet_series.txt added

Changed 3 weeks ago by starlight

comment:2 Changed 3 weeks ago by starlight

Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples.

Possibly strike this. What I may be seeing are stale max advertised values so old they seem like averages related to recent history.

comment:3 Changed 3 weeks ago by starlight

Version: sbws: unspecifiedsbws: 1.1.0

comment:4 Changed 3 weeks ago by teor

Description: modified (diff)
Keywords: sbws-majority-blocker added
Milestone: sbws: 1.1.x-final
Priority: MediumVery High
Severity: NormalCritical
Summary: SBWS logic related to max advertised bandwidth is brokenSBWS is using max advertised bandwidth from 5 day old descriptors

Hi, when you log bugs like this, please show us:

  • the full torflow and sbws bandwidth file lines,
  • the exact lines that show the bug, and
  • the relevant values in your torrc.

Explain these bugs like we don't know what you're talking about.

If you don't provide good explanations, we don't know how serious the bug is. So we leave it for days or weeks until we have time to look at the details.

Here's a detailed explanation for this bug:

The MaxAdvertisedBandwidth was changed around 2019-05-23 00:30? to a lower value.

longclaw is running sbws. It shows no bandwidth change on 05-23, but does show a bandwidth change on 05-28, 5 days after the change. (sbws results expire after 5 days.)

bw=56000 desc_bw_avg=1073741824 time=2019-05-22T18:46:34
bw=62000 desc_bw_avg=1073741824 time=2019-05-24T00:58:14
bw=62000 desc_bw_avg=1073741824 time=2019-05-27T20:17:40
bw=47000 desc_bw_avg=1073741824 time=2019-05-29T02:18:34

bastet and maatuska were running torflow. They change immediately, or after a few hours:

bastet:

bw=131000 updated_at=2019-05-23T00:42:57 pid_bw=86361445
bw=38300 updated_at=2019-05-23T17:05:28 pid_bw=86361445

maatuska:

bw=98300 updated_at=2019-05-21T02:33:18 pid_bw=91584752
bw=76700 updated_at=2019-05-23T06:07:45 pid_bw=91584752

Or possibly

bw=79000 updated_at=2019-05-23T06:07:45 pid_bw=91584752
bw=65900 updated_at=2019-05-25T00:35:22 pid_bw=91584752

bastet then switched to sbws, which appears to have picked up the MaxAdvertisedBandwidth value straight away:

bw=48700 updated_at=2019-05-25T09:47:17 pid_bw=86361445
bw=58000 desc_bw_avg=36700160 time=2019-05-27T23:51:27

This bug blocks deployment of sbws to more than half the bandwidth authorities.

We might be able to diagnose this bug better when we know the answers to these questions:

  • What was the MaxAdvertisedBandwidth value before and after the change?
  • What was the exact time of the change?
  • Do torflow and sbws report times in UTC?
  • What are the full sbws bandwidth file lines around the time of the change, and 5 days after the change?
    • Do we need to add more diagnostics to sbws relay lines?

comment:5 Changed 3 weeks ago by starlight

Stand by it; still true:

"SBWS longclaw never recognized the change"

desc_bw_avg=1073741824
nick=powerlay
node_id=$C381468CDA4A90664F0889A6AE334547AEE4D9B4
time=2019-06-03T05:50:32

What the scanner did was--after the protracted delay of five days--detect lower available bandwidth which resulted as a consequence of the reduced configured maximum, because BandwidthRate was used to establish it rather than MaxAdvertisedBandwidth.

The raw data is all here https://collector.torproject.org/archive/relay-descriptors/bandwidths/

A one-line perl script extracted the attachment data such that it is comprehensible in the context of the issue.

Last edited 3 weeks ago by starlight (previous) (diff)

comment:6 Changed 3 weeks ago by teor

Summary: SBWS is using max advertised bandwidth from 5 day old descriptorsSBWS does not detect some changes in max advertised bandwidth

comment:7 Changed 3 weeks ago by teor

Description: modified (diff)

comment:8 Changed 3 weeks ago by teor

We have already tried to implement this feature twice, in #28588 and GitHub ticket 155.

We need to create better tests for this feature, so we can be sure it is working correctly.

Last edited 2 weeks ago by teor (previous) (diff)

comment:9 Changed 2 weeks ago by starlight

Defect is extremely severe. SBWS should be removed from production until this is fixed. Longclaw has thirteen percent of listed relays with incorrect descriptor information more than 7% away from the published consensus.

Changed 2 weeks ago by starlight

list of relays where longclaw descriptor differs more than 7% from consensus

comment:10 Changed 2 weeks ago by starlight

correction: neglected to take absolute value of the difference, longclaw has 33% of relays more than 7% away from the consensus.

comment:11 Changed 2 weeks ago by teor

Summary: SBWS does not detect some changes in max advertised bandwidthSBWS does not detect changes in max advertised bandwidth

comment:12 in reply to:  9 Changed 2 weeks ago by teor

Our acceptance criteria for sbws is a 50% variance from the consensus, because that's what torflow's variance was before we started deploying sbws. See #27339.

Replying to starlight:

Defect is extremely severe. SBWS should be removed from production until this is fixed.

This defect has a limited impact. So we won't remove sbws from production.

Longclaw has thirteen percent of listed relays with incorrect descriptor information more than 7% away from the published consensus.

Why pick 7%?

The total consensus weight of the bandwidth authorities now varies by up to +43%, -14% from the consensus.
https://metrics.torproject.org/totalcw.html

comment:13 Changed 12 days ago by starlight

Summary: SBWS does not detect changes in max advertised bandwidthSBWS does not detect changes in descriptor bandwidth values

between 2019-05-25 16:00 and 2019-06-12 03:00 longclaw sbws shows descriptor bandwidth max/burst/observed for 4138 relays with values that match perfectly, out of 5593 relays present at both times

74% of relays with no update whatsoever to descriptor information for 17 days

Changed 12 days ago by starlight

comment:14 Changed 9 days ago by juga

I realized we don't have these options: https://stem.torproject.org/tutorials/mirror_mirror_on_the_wall.html#can-i-get-descriptors-from-the-tor-process

Looking at the log, found it was removed in version 0.1.1 and never added again: https://github.com/torproject/sbws/commit/816591fa67593d74d4facc63ed202a9cd1cf0d99.

I was quite surprised since we talked about the options several times, so searched for tickets and i didn't find anywhere were we said to remove them: https://github.com/torproject/sbws/issues/146#issuecomment-385974474, https://github.com/torproject/sbws/pull/147#discussion_r185492495, https://github.com/torproject/sbws/issues/144, https://trac.torproject.org/projects/tor/timeline?from=2018-05-02T01%3A56%3A25Z&precision=second.

The way i have work so far is:

  • implement the patch and the tests, but this is hard to test with the current test network we have
  • create PR and set to review
  • wait to get it review
  • release a new version
  • update debian package
  • install the new package in production when it's available (and announce we'll use new version)
  • then finally we can see if the patch is working

So, because it's a long process, i'm just going to add those options to longclaw's sbws, wait ~3 days (the time it takes to measure the whole network) and check if some descriptors' bandwidth changed.

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

comment:16 Changed 9 days ago by juga

Summary: SBWS does not detect changes in descriptor bandwidth valuessbws does not detect changes in descriptor bandwidth values

We always referred to the scanner with lower letters.

comment:17 Changed 9 days ago by starlight

perhaps it would be best to not rely on scanning client instances for descriptor information and either a) run a low-limit relay with DirCache=1 along with Fetch* settings expressly for providing descriptors or b) connect directly to authorities or fallback directories when obtaining descriptors

comment:18 in reply to:  14 ; Changed 8 days ago by juga

Status: newneeds_review

Replying to juga:
..

So, because it's a long process, i'm just going to add those options to longclaw's sbws, wait ~3 days (the time it takes to measure the whole network) and check if some descriptors' bandwidth changed.

Because when restarting it, it takes correctly the descriptor bandwidth of powerlay, what i did was to change the advertised bandwidth of a relay i've access to after i restarted sbws (at ), and sbws detected the change at XX

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.

comment:19 in reply to:  17 ; Changed 8 days ago by juga

Replying to starlight:

perhaps it would be best to not rely on scanning client instances for descriptor information and either a) run a low-limit relay with DirCache=1

This is the default

along with Fetch* settings expressly for providing descriptors

This is what i said in https://trac.torproject.org/projects/tor/ticket/30733?replyto=17#comment:14.
However i don't understand what you mean by "client instances", this configuration affects the sbws' "client instance" to which communicates via socket.

or b) connect directly to authorities or fallback directories when obtaining descriptors

I considered this time ago (https://github.com/torproject/sbws/blob/master/sbws/lib/relaylist.py#L400), but i'm avoiding to do any request not via the Tor network, and stem's remote do that.

comment:20 Changed 8 days ago by juga

Owner: set to juga
Status: needs_reviewassigned

i need to check a couple of things before this get reviewed.

comment:21 in reply to:  18 ; Changed 7 days ago by teor

Replying to juga:

Replying to juga:
..

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.

In tor, we call development versions "alpha-dev", and include the commit hash in the version.
Let's do the same thing with sbws?
So for example, longclaw should now be 1.1.1-alpha-dev.
See #30899 for a ticket, it is not urgent or important.

comment:22 in reply to:  19 Changed 7 days ago by teor

Replying to juga:

Replying to starlight:

perhaps it would be best to not rely on scanning client instances for descriptor information and either a) run a low-limit relay with DirCache=1

This is the default

juga, sbws runs a client (no ORPort), so the DirCache setting is ignored.

starlight, no, we don't have time to make major changes to the sbws design at this point.

along with Fetch* settings expressly for providing descriptors

This is what i said in https://trac.torproject.org/projects/tor/ticket/30733?replyto=17#comment:14.
However i don't understand what you mean by "client instances", this configuration affects the sbws' "client instance" to which communicates via socket.

This fix is the best way to get all the current relay descriptors on a tor client.

If it doesn't work, then sbws has a bug where it is not storing descriptor information for long enough, and we should fix that bug in sbws.

or b) connect directly to authorities or fallback directories when obtaining descriptors

I considered this time ago (https://github.com/torproject/sbws/blob/master/sbws/lib/relaylist.py#L400), but i'm avoiding to do any request not via the Tor network, and stem's remote do that.

We don't have time to make major changes to the sbws design at this point.
Major changes are high risk, and they may introduce other bugs.
Let's just fix the bugs that we are seeing, with the smallest possible changes.

comment:23 Changed 7 days ago by teor

Status: assignedneeds_revision

I did a review of this pull request: the code looks good, but one of the comments is wrong.

See my comment on the pull request:
https://github.com/torproject/sbws/pull/358/commits/2b3860678595dc8a1f26ab93bc6cef54fe796a7d#r294117174

comment:24 in reply to:  21 ; Changed 7 days ago by juga

Replying to teor:

Replying to juga:

Replying to juga:
..

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.

In tor, we call development versions "alpha-dev", and include the commit hash in the version.
Let's do the same thing with sbws?
So for example, longclaw should now be 1.1.1-alpha-dev.

Right now sbws is 1.1.1-dev0, since you say that it should include commit hash, should it be 1.1.1-alpha-dev-commithash?

See #30899 for a ticket, it is not urgent or important.

In this case where i also changed the longclaw's sbws, should not i add 1 more commit to this PR to change the version rather than doing so in a different ticket?. And then change also the version in longclaw's sbws to the same in this PR?.

comment:25 in reply to:  24 Changed 7 days ago by teor

Replying to juga:

Replying to teor:

Replying to juga:

Replying to juga:
..

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.

In tor, we call development versions "alpha-dev", and include the commit hash in the version.
Let's do the same thing with sbws?
So for example, longclaw should now be 1.1.1-alpha-dev.

Right now sbws is 1.1.1-dev0, since you say that it should include commit hash, should it be 1.1.1-alpha-dev-commithash?

I'm not sure if there are python or Debian conventions. If there are, you should follow python or Debian, rather than the conventions that tor made up.
I think 1.1.1-dev0 is fine, because it says "dev".
But that's not what longclaw is showing in its bandwidth file: it says "software_version=1.1.0".

I think the commit hash should be dynamic, and be a different field in the bandwidth file.
I'll add details to #30899.

Just so you know, the tor version looks like:
"Tor 0.2.9.16 (git-9ef571339967c1e5)"

See #30899 for a ticket, it is not urgent or important.

In this case where i also changed the longclaw's sbws, should not i add 1 more commit to this PR to change the version rather than doing so in a different ticket?. And then change also the version in longclaw's sbws to the same in this PR?.

I think longclaw should say "software_version=1.1.1-dev0".

comment:26 in reply to:  20 Changed 7 days ago by juga

Replying to juga:

i need to check a couple of things before this get reviewed.

I removed this in needs_review, because:

i've access to after i restarted sbws (at ), and sbws detected the change at XX

Because i was waiting for the next bandwidth file. Then saw that the descriptor average didn't change, so i need to still check whether i'm doing a mistake configuring the relay bandwidth values or the relay didn't publish the descriptor or sbws didn't not receive it yet.

However, thanks teor for the review.

comment:27 Changed 7 days ago by starlight

in longclaw data between 20190615-05 and 20190617-05 only 11% of relay descriptor tuples match exactly--dramatic improvement, the change appears to correct the issue


my suggestion regarding sourcing descriptors from a relay tor instance rather than client-only instance was intended for longer term (unless the parameter change didn't work) -- the thought is dir-serve relay mode must always maintain a complete set of current descriptors and is less likely to regress than client modes; or sourcing from authorities has the advantage of obtaining descriptors with no propagation delay

documentation on DirCache is confusing, seems to imply relation to dir-server readiness where it really relates to minimizing client memory utilization on very small hardware; what matters is FetchDirInfoEarly=1 FetchUselessDescriptors=1 are in effect activated by relay mode operation

comment:28 in reply to:  27 ; Changed 7 days ago by juga

Replying to starlight:

in longclaw data between 20190615-05 and 20190617-05 only 11% of relay descriptor tuples match exactly--dramatic improvement, the change appears to correct the issue

This improvement might be caused by restarting sbws, not just the Fetch* options i added.

I think also that when changing relaylist to keep the number of consensuses (417ebfa90429d3531e8a5310cf58dff4fc2ff158), the relays objects stop to be replaced in each consensus, but was forgotten to update also the descriptors attributes.

I've restarted longclaw's sbws with this change too (e2ee16e1c42a243dec419b9dcc7555b0621c3e8b) and the current sbws version (1.1.0-dev0). Again need to wait until the relay i changed gets measured, so not putting this in needs_review yet.

So maybe the descriptors where not being fetch with lot of delay, but some Fetch* options will ensure that they are actually recent. After a conversation with arma in irc, i'm just not sure that which Fetch* options we really need.

comment:29 Changed 7 days ago by teor

This code has a typo:
descriptor = c.get_server_descriptor(r.pringerprint,
should be:
descriptor = c.get_server_descriptor(r.fingerprint,

comment:30 in reply to:  28 Changed 7 days ago by teor

Replying to juga:

So maybe the descriptors where not being fetch with lot of delay, but some Fetch* options will ensure that they are actually recent. After a conversation with arma in irc, i'm just not sure that which Fetch* options we really need.

  • we use FetchUselessDescriptors in sbws to make sure tor keeps on downloading descriptors, even if sbws is idle
  • sbws does not need FetchDirInfoExtraEarly, but if we want it to respond to MaxcAdvertisedBandwidth changes as soon as possible, we should use it
  • FetchDirInfoEarly is redundant when FetchDirInfoExtraEarly is set, so we can remove it

Now is a good time to change the comments in #30733, so we know why sbws is using each option, next time someone asks these questions.

comment:31 Changed 7 days ago by teor

Hi Juga,

chutney runs travis tests with all supported python versions between 2.7 and nightly.
Here's a list of their names in Travis:
https://github.com/torproject/chutney/blob/master/.travis.yml#L106

comment:32 Changed 7 days ago by teor

Also, python 3.5's end of life was last month, I think.

Note: See TracTickets for help on using tickets.