Opened 4 years ago

Closed 19 months ago

#17810 closed defect (wontfix)

TorFlow should ignore self-reported bandwidths when measuring relays

Reported by: robgjansen Owned by:
Priority: Medium Milestone:
Component: Core Tor/Torflow Version:
Severity: Normal Keywords:
Cc: dgoulet, aagbsn, starlight@…, s7r, tyseom Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

A relay that self-reports high bandwidth values will get an inflated consensus weight. I believe that TorFlow somehow uses the self-reported values when producing a measurement result for a relay. We should fix TorFlow so that it better handles self-reported values in order to prevent a relay from accidentally or maliciously getting uncharacteristically high consensus weights.

Child Tickets

Attachments (1)

bwscan_cnsns.20180429-1345-plus.xlsx (1.9 MB) - added by starlight 20 months ago.
Torflow + consensus what-if spreadsheet as-of 20180429-1345 -- rev 3

Download all attachments as: .zip

Change History (23)

comment:1 Changed 4 years ago by robgjansen

Component: - Select a componentTorflow
Owner: set to aagbsn

comment:2 Changed 4 years ago by arma

Cc: aagbsn added

A) I wonder how Aaron (cc'ed) is doing at his OTF fellowship project on exactly this topic? Aaron?

B) A helpful workaround in the short-term might be for the bwauths to never increase their weight for a relay by more than some multiple. Then it would take a while for the weights to crank themselves up to crazy numbers, giving folks more of a chance to notice that something is weird.

comment:3 Changed 4 years ago by starlight

May be related to #16696. I noticed that fallback to self-measure might be applied per-relay rather than globally. Casual past observation that I did not re-check for this post.

comment:4 Changed 4 years ago by starlight

Cc: starlight@… added

comment:5 Changed 4 years ago by s7r

Cc: s7r added

How much do advertised weights count for bwauths? What is the algorithm? Obviously bwauths don't (100%) trust/use the self-advertised weights otherwise they wouldn't be called bwauths.

IIRC the bwauths can measure any relay, regardless their own speed. We host one on a 250mbit line and it has no problems measuring the fast relays. Maybe we can come up with an algorithm that handles the self-advertised weights smarter.

comment:6 Changed 4 years ago by starlight

If it was self-measure fallback, the question is what happened to the 10,000 maximum?

If not, must be a corner-case Torflow bug since BW usually takes days to ramp.

comment:7 Changed 4 years ago by starlight

Checked archive, definitely a consensus of Torflow results:

PhantomTrain6 2015-12-10

          desc_bw   longclaw  maatusak  moria1    consensus
05:00               -         -         -         Unmeasured=1
06:00     1280M     2030000   -         -         Unmeasured=1
07:00     1280M     2030000   -         -         Unmeasured=1
08:00               1170000   -         -         Unmeasured=1
09:00     1280M     1170000   -         -         Unmeasured=1
10:00               2150000   -         -         Unmeasured=1
11:00     1280M     2150000   -         -         Unmeasured=1
12:00               2150000   -         -         Unmeasured=1
13:00     1280M     2150000   -         -         Unmeasured=1
14:00               2150000   1280000   2150000   2150000
15:00               2150000   1280000   2150000   <missing>
16:00               305000    1280000   2150000   <missing>
17:00               305000    1280000   2150000   <missing>
18:00     1280M     305000    1280000   1900000   1280000

comment:8 Changed 4 years ago by tyseom

Cc: tyseom added

comment:9 Changed 4 years ago by tyseom

How about introducing hard caps as a simple stop-gap solution?

These are currently the biggest CW values of the fastest relays:
221000
197000
174000
164000
160000

So how about a cap at 250000 (=2GBit/s)?

comment:10 Changed 2 years ago by teor

Priorities and Severities in torflow are meaningless, setting them all to Medium/Normal.

comment:11 Changed 2 years ago by teor

Owner: aagbsn deleted
Status: newassigned

aagbsn was the default owner, unassigning

comment:12 Changed 2 years ago by teor

Status: assignednew

Mark all tickets that are assigned to nobody as "new".

comment:13 Changed 20 months ago by starlight

In relation to the SBWS effort I think it makes sense to preview this by running Torflow with the behavior adjustment on Tom Ritter's test scanner. That is of course if Tom doesn't mind and perhaps likes the idea.

The change: Have aggregate.py substitute the average of self-measure bandwidths for the appropriate class in place of self-report bandwidth of individual relays when calculating each final vote. It could even make sense to bias off a constant for each class to improve stability of votes and consensus medians while retaining class voting biases, e.g. 10000 for exits, 9700 for guards, 1200 for middle-only. Or go a bit radical and apply a single constant such as 5000 while folding all relays into a single class with an edit to Node::node_class(). Perhaps try out both approaches.

Looking at it now it seems to me Node::node_class() should return Exit/Guard/Middle and forget the rest as separate offsets for the different roles relays can participate under are not calculated. Treating Exit+Guard and Exit (only) as independent bandwidth classes no longer makes sense. A case might still exist for Guard and Middle since Middle-only relays comprise just over half the relay population, though with an average bandwidth around 12% of each of the Exit and Guard classes. Or just two classes, 'Exit' and 'NonExit' might work better. . .or one, that is no classes.

While Torflow votes are unitless, they resemble actual bandwidths owing that they are interpretations of bandwidth measurements taken at each node. Using class-average bandwidths as baselines for calculating votes retains this property. Probably a correct method is to decay-average values in typical manner to mitigate the impact of jitter and drift on effective valuation of older votes before replacement. On the other hand hammering in reasonable constants is expedient for a test and will save some trouble. While I'm on the subject of aging votes, seems to me measuring guard relays less frequently than non-guards is unhelpful and should be binned.

Was recently reading Torflow code and wrote a script approximating Torflow calculations. Am dangerous enough now to write a patch implementing the above.

comment:14 Changed 20 months ago by teor

Thank you for your offer to submit a patch to torflow.
But fixing torflow is not on our roadmap.

We are already running sbws instances on the public tor network.
If we are going to put effort into comparisons, I would like to focus on comparing sbws with the existing torflow instances.
If we are going to put effort into modifying code, I would like to focus on developing sbws.

comment:15 Changed 20 months ago by starlight

Fair enough. Appears to be a terrible idea anyway.

Cooked up the attached spreadsheet and eliminating self-measure does not seem to work. Also tried applying a 20% linear factor to Torflow's progressive-offset vote generation method; perhaps retaining scanner biased self-advertised bandwidth while demphasizing its consensus impact has merit.

This spreadsheet might be useful for brainstorming and what-if analysis.

comment:16 Changed 20 months ago by starlight

replaced sheet with correction for a mistake

Changed 20 months ago by starlight

Torflow + consensus what-if spreadsheet as-of 20180429-1345 -- rev 3

comment:17 Changed 20 months ago by starlight

A consensus parameter already exists for applying a factor to progressive offset calculation: bwauthkp

employed at aggregate.py:120

searched and it does not appear anyone has tried tuning it since

K_p = 1.0
T_i = 0
T_d = 0

were established in

https://trac.torproject.org/projects/tor/ticket/4596#comment:2

https://gitweb.torproject.org/torflow.git/commit/NetworkScanners/BwAuthority/aggregate.py?id=4a4b8a73185f763f0def3e0d30c052f3abeb6fa0

from observing Torflow behavior, it seems to me a K_p of 1.0 is a bit strong

comment:18 Changed 20 months ago by robgjansen

While Torflow votes are unitless, they resemble actual bandwidths owing that they are interpretations of bandwidth measurements taken at each node.

Careful here. I think TorFlow measures something closer to residual bandwidth capacity at the time of the measurement, not the full capacity of the link. And it doesn't even measure residual capacity exactly, because of scheduling and fairness. For example, if my relay is operating at 100% link utilization and TorFlow tries to measure it, TorFlow isn't going to get 0 bandwidth and it isn't going to get 100% bandwidth; TorFlow is probably only going to get roughly 1/N of my bandwidth where N is the number of other active flows.

Or am I misunderstanding and the authorities interpret the measurements differently?

comment:19 Changed 20 months ago by teor

I will enjoy having a bandwidth measurement specification, because then I won't have to ask questions like:

  • when you say "authorities", which part of the bandwidth measurement system are you referring to?

I think all the interpretation is within the bandwidth measurement system, or within tor clients.

Here's a summary of the process:

  1. Torflow measures the available bandwidth at the relay, which is approximately max(current residual bandwidth, available bandwidth / number of current flows)
  2. Torflow converts this figure into kilobytes per second and stores it
  3. Torflow aggregates measurements and self-reported bandwidths to produce a figure that is technically unitless, but is practically kilobytes per second
  4. The authorities read the bandwidths file and put the numbers from the file in their votes
  5. The consensus contains the low-median bandwidth for each relay as the consensus weight
  6. Clients use consensus weights and position weights to choose randomly weighted paths through the network

comment:20 in reply to:  18 Changed 20 months ago by starlight

Replying to robgjansen:

While Torflow votes are unitless, they resemble actual bandwidths owing that they are interpretations of bandwidth measurements taken at each node.

Careful here. I think TorFlow measures something closer to residual bandwidth capacity at the time of the measurement, not the full capacity of the link.

Yes, of course.

And it doesn't even measure residual capacity exactly, because of scheduling and fairness. For example, if my relay is operating at 100% link utilization and TorFlow tries to measure it, TorFlow isn't going to get 0 bandwidth and it isn't going to get 100% bandwidth; TorFlow is probably only going to get roughly 1/N of my bandwidth where N is the number of other active flows.

Or am I misunderstanding and the authorities interpret the measurements differently?

I may not have this perfectly, but it seems to me that Torflow calculates the ratio/percent offset of the measurement for each relay relative to the average of all relay measurements (or all relays handled by the particular scanner, not sure). Then this value feeds into the "PID error", which presently is limited just the "P" or progressive component and so is in effect a pass-through of the scanner offset. Is then applied to the self-measure of a node under consideration, thereby mirroring the residual bandwidth offset onto the actual declared bandwidth. That's why Torflow votes somewhat resemble real bandwidth capacities. Votes could just as easily have an arbitrary basis so long as the consensus fractions work out the same, but it's nice to have semi-reasonable values to look at.

The request in this ticket and one of the stated design goals of SBWS is to take self-measure out of voting process, but I am skeptical that this will turn out practical. Certainly plugging averages of whole class (exit/guard/middle) self-measurements in place of per-relay self-measure looks terrible in the spreadsheet. Can be seen by sorting on a hypothetical vote column and glancing over at the Maatuska vote and consensus weight columns. I may try revising it with averages specific to each scanner to see if it helps, but I doubt it.

Last edited 20 months ago by starlight (previous) (diff)

comment:21 in reply to:  18 Changed 20 months ago by starlight

Replying to robgjansen:

. . . And it doesn't even measure residual capacity exactly, because of scheduling and fairness. For example, if my relay is operating at 100% link utilization and TorFlow tries to measure it, TorFlow isn't going to get 0 bandwidth and it isn't going to get 100% bandwidth; TorFlow is probably only going to get roughly 1/N of my bandwidth where N is the number of other active flows.

Excellent point! I had not considered this previously.

comment:22 Changed 19 months ago by teor

Resolution: wontfix
Status: newclosed

We won't fix this issue in Torflow.

The conversation continues on this sbws ticket:
https://github.com/pastly/simple-bw-scanner/issues/150

Note: See TracTickets for help on using tickets.