wiki:org/meetings/2018Rome/Notes/BandwidthAuthorityRequirements

Minimum Requirements

(An initial draft sent to the dir-auth list)

Any bandwidth authority implementation:

  • must measure relay capacity
  • must not rely on self-reported capacities
    • that is, bandwidth authorities can start with the relay's own measurements, but they must use some measurements that are *not* the relay's own measurements
  • must include new relays within a short time
  • must produce results that Tor can parse
  • must update results so they are not stale

Bandwidth authorities that produce votes for the same consensus:

  • must scale results so they are similar to existing results

(This spec deliberately allows for implementations like OnionPerf and PeerFlow. I don't want to require a clone of the current design.)

I tried to keep the list small, because every extra feature delays the implementation.

There are a bunch of other features that are nice, but not essential.

For example:

  • using multiple bandwidth servers

And there are some features that we're not sure if we want.

For example:

  • limiting capacity in some countries, ASs, ...

We can list these extra features in the spec, but I don't want to spend much time on them.

Scaling

Express each weight as a proportion of the total, and multiply by some agreed total (e.g. for the current network it would have to be the total of the consensus weight, but within some limited range to avoid unbounded growth).

See the graphs in #25459: https://trac.torproject.org/projects/tor/attachment/ticket/25459/totalcw-2018-03-11a.png

agenda items

  • remember the list of current and future implementations
  • teor's minimum design requirements (review wiki page)
    • essential
    • really useful
    • interesting
    • controversial
  • torflow
  • bwscanner
  • bridge bwscanner
  • Nick wants a plan for next steps, and a clear roadmap.
  • people with history:
    • juga (bwscanner)
    • Linus (bwscanner)
    • pastly (in 2 months, will start work)
    • Micah (keeping torflow running)

bwscanner past devs:

  • Donncha
  • Aaron gibson
  • tom ritter

Nick says: on the network team, they're prioritizing helping anyone working on the bwscanner

  • stable
  • measure bandwidth
  • measure at least as well we do now
  • people should be willing to run it

bandwidth output: one integer per tor relay that allows selection of nodes to distribute bandwidth.

we want it to be difficult to game.

Currently it's not running because it's annoying to run.

  • dependency issues
  • security issues

you don't know when things aren't working until a few days later.

iang says that kat offered to run it but no one would take her data.

difficulties deciding whose data to trust.

  • fail early
  • produce meaningful progress/results/diagnostic outputs

  • do we want to be able to generate the "current view" at any given time, or are consecutive reports acceptable?

Ian G. says that we should think about the threat model in terms of making it gamable. Roger says that's probably nice-to-have, but not essential in the near term.

Teor: MUST NOT rely on self-reported capacity, though it can can start with those reports

Teor: MUST include new relays within a short time (what is the window?)

Nick: prefer to scan volatile relays; if a relay has been stable, deprioritize it.

Roger: avoid committing to a full batch -- if the scanner is looping, at the top of each loop it could identify specific relays that you want to test.

Paul: in principle, we want to measure between pairs of relays, not specific relays. but we can't do that now.

Nick: peerflow successors are unlikely to be ready in the short term.

MUST have privacy considerations -- it should not measure in such detail that it is effectively a global adversary.

Micah: bandwidth authorities themselves must not be detectable so that they can't be gamed. (nice-to-have)

Roger: resistance-to-gaming in general is on nice-to-have


  • new and shiny vs cruddy but functional, emergency replace
  • 2/6/8 months is our short term definition
  • simplicity of measurment:
  • should bwscanner output the same as torflow? we _can_ change dirauth's to read something else but; _results_ don't have to be identitcal
  • multiple "pipes of measurment" must be able to coexist <- must have
  • must not slow down the network
  • must not be too expensive to run
  • combination is _median_; add up all the numbers and divide by N is a plausible way to normalise the numbers
  • process of turning bwauth numbers into something consumable by clients
  • nice to have: measuring cpu load, socket avail, ram preasure
  • are expermimental results interesting? shadow experiments f.ex.
  • nice to have: all load to all the computers in .de is not so good
  • volatility;
  • how to get the short term one
    • patch torflow
    • make bwscanner work
    • the simple thing from scratch
    • bridge bw scanner

who will bell the cat?

if we run multiple bwscanners concurrently then their output measurements need to be compatible -- normalize measurements so that they can be comparable.

MUST: be able to normalize measurements.

nikita points out that there are feedback loops inherent in this design, since the clients tend to select routes proportionally to the output.

Nick: do we have evidence that the feedback mechanism works in practice?

The earlier approach was to use the previously-measured values as the way to select, but that was abandoned because fast relays were driven to infinity, everything else driven to zero.

Nice-to-have: measuring CPU, memory, socket availability, etc.

Does adding bandwidth authorities in different geolocations help or harm the consensus?

Aaron Johnson suggests that there some experimental results could be helpful. several simple, rough designs.

George suggests experimenting using the test network, or using Shadow.

Nick: if we're testing new things, when are we ready to deploy it on the network?

requirement: must be able to run multiple scanners at once.

nice-to-have: spreads load over the world to get our distributed privacy properties.

Linus suggests that we should fix that problem at another level; if we fold this into bw measurements, we're getting confused.

Roger, Paul, Nikita: we're not actually just measuring bandwidth; it's a mix of latency, diversity, etc. measurements from australia will give different results from measurements from austria.

Rob: client-side changes should be explicitly separated from measuring relays.

Nick: nice-to-have: resolution and volatility should not be greater than necessary. relative stability makes consensus diffs easier to compress.

Iang: the bandwidth authorities shouldn't care about reducing precision -- dirauth consensus operation should be the place for reducing volatility and precision.

Nick: this might mean changing the way consensus is created -- Roger: this is a tor fix, not a bwscanner fix.

Options for moving forward (short term):


  • patch torflow -- nick says not acceptable.

  • make bwscanner a new thing
  • do the simple thing from scratch
  • the "onionperf thing" -- Rob says that onionperf uses stem; it would be better to just use stem directly. because onionperf does other stuff that bwscanner needs.

we are skipping discussion about a birdge scanner because no one understands it in detail here, and george reports that his discussion from isis suggest that bridgescanner isn't suitable for converting into a bwscanner -- different goals.

general consensus: implementer ought to decide.

Juga: the existing code looks doable, but it seems hard to maintain. I could reuse parts of bwscanner, but would probably be using features of txtorcon and stem.

George: Aaron and donncha say that the main thing missing is turning

the data into the output format.

Juga: nlnet funding proposal is through the second round, hopefully find out in 2 weeks. i could try also with prototype fund, or with OTF. Help/feedback/data from tor contacts for these proposals would be welcome.

  • network team: teor
  • teor might be available until july, taylor, isis might be available for mentoring
  • linus: i have experience running bwscanner from github. we've found bugs; they are acknowledged, but they aren't getting fixed. ; running it on different systems seeems ike they work, but they don't know when they're done. for the full network it was > 24 hours, probably ~2 days. it doesn't do the very last step. error handling isn't well-tested. it requires that you're comfortable with twisted.
  • nick: some parts are well-documented, some are not.
  • we don't know what Aaron and Donncha's situation is
  • George thinks that Aaron is interested, but doesn't have time/funding
  • David: meejah is also interested in helping out, since it grew features during earlier development.
  • Roger: should we try to fix the last few bugs in bwscanner?
  • nick: if it's just the last few bugs, then why isn't bwscanner fixed already? it would be nice if the current people finish it.
  • Roger: so the message to the bwscanner team is we would like if it was finished. in the meantime, we have a plan for the two people to implement it.
  • Juga: do we have a plan/timeline?
  • Nick: better to deploy frequently, even if it is worse than what we want.
  • Linus: tom, roger, cena, kat, BSD george, stephanie -- needs to be deployable to ~5 people, not to hundreds
  • Juga: is it ok if i implement something, rather than fix bwscanner bugs? nick: seems OK
  • Linus: i can provide a well-connected machine to run it from.
  • Roger: it should be run on the real network asap
  • Nick: if it takes a long time to run, run it on a subset of the network so we can look at the results.
  • conversation should happen on IRC and tor-dev.
  • the simple algorithm is probably what we're looking for in the first round.
  • do we need peerflow collaboration? pastly can help out with initial system, but will work on peerflow eventually
  • Rob: how can we prevent a year of work not producing results? what kind of transparency do we need to make this happen?
  • David: lack of communication is the problem.
  • Nick: three network devs could check in regularly.
  • Roger: maybe we need simple near-term milestones:
    • should be a script that produces numbers
    • it should be run on the real network
  • Nick: network team has committments for this
  • Linus: funding is an issue here. we can't just presume that the funding is in place, we need to keep presure on it.
  • Nick: keep isa and tommy involved in the funding discussion.
  • Micah and Nick: ensure that there is visible ongoing development
  • Roger: what can we do to help?
  • open questions about what particular algorithm to use? how can we control other variables? do we want relays under our control?
  • how are exits chosen, and where are the files downloaded from? how big are the files that we try to download?
  • current best practice is to measure by time, not by size. fetch over a certain amount of time, throw away the first and last bits, see how much went through
  • have script that builds path and fetches data should be separate from decisions about parameters of what was fetched and for how long.
  • can we use conflux or something like this? it could be a DDoS against the new relays. this could be done selectively against those relays with the unmeasured flag.
  • if we measure the same relay 10 times, what should we do to combine these numbers?
  • self-reported measurements should be treated as a cap, so that we don't overallocate traffic through that.
  • bwscanner needs to provide an actual number, not put that responsibility off to another party.
  • for the short-term thing, we're making educated guesses, not making a robustly tested experiment.
  • Tom Ritter has scripts that show the differences between bw authority data
  • use of these numbers should push traffic disproportionately to the fast relays -- when a relay hits capacity, it behaves worse
  • Linus: how important is it to keep the slow relays in the network? what does pushing them out do to network diversity?
  • what balance do we want between optimizing performance and maintaining network security?
  • MUST HAVE: must not reduce network diversity by pushing out slow relays
  • if we don't max out our slow relay operators, then we might be discouraging them from increasing capacity.
  • Iang proposes a "proof of storage" protocol, which delegates the bandwidth measurements to little relays, which run in aggregate to measure bigger relays
  • Paul: if we only use the small nodes for measurements, it changes the incentives for relay operators.
  • nice-to-have: would be nice if simple tool can be re-used for bridge measurement

Requirements based on the notes

https://bw-scanner-notes.readthedocs.io/en/latest/analysis.html, https://bw-scanner-notes.readthedocs.io/en/latest/requirements.html

Last modified 4 months ago Last modified on Mar 18, 2018, 8:59:21 PM