Ticket #2861: bwauth-spec.txt

File bwauth-spec.txt, 13.3 KB (added by karsten, 8 years ago)

First draft of a bandwidth scanner specification

Line 
1
2                      Bandwidth Scanner specification
3
4
5          "This is Fail City and sqlalchemy is running for mayor"
6                                  - or -
7   How to Understand What The Heck the Tor Bandwidth Scanners are Doing
8
9
10                             Karsten Loesing
11                                Mike Perry
12
130. Preliminaries
14
15   The Tor bandwidth scanners measure the bandwidth of relays in the Tor
16   network to adjust the relays' self-advertised bandwidth values.  The
17   bandwidth scanners are run by a subset of Tor directory authorities
18   which include the results in their network status votes.  Consensus
19   bandwidth weights are then used by Tor clients to make better path
20   selection decisions.  The outcome is a better load balanced Tor network
21   with a more efficient use of the available bandwidth capacity by users.
22
23   This document describes the implementation of the bandwidth scanners as
24   part of the Torflow and TorCtl packages.  This document has two main
25   sections:
26
27    - Section 1 covers the operation of the continuously running bandwidth
28      scanners to split the set of running relays into workable subsets,
29      select two-hop paths between these relays, perform downloads, and
30      write performance results to disk.
31
32    - Section 2 describes the periodically run step to aggregate results
33      in order to include them in the network status voting process.
34
35   The "interfaces" of this document are Tor's control and SOCKS protocol
36   for performing measurements and Tor's directory protocol for including
37   results int the network status voting process.
38
39   The focus of this document is the functionality of the bandwidth
40   scanners in their default configuration.  Whenever there are
41   configuration options that significantly change behavior, this is
42   noted.  But this document is not a manual and does not describe any
43   configuration options in detail.  Refer to README.BwAuthorities for the
44   operation of bandwidth scanners.
45
461. Measuring relay bandwidth
47
48   Every directory authority that wants to include bandwidth scanner
49   results in its vote operates a set of four bandwidth scanners running
50   in parallel.  These bandwidth scanners divide the Tor network into four
51   partitions from fastest to slowest relays and continuously measure the
52   relays' bandwidth capacity.  Each bandwidth scanner runs the steps as
53   described in this section.  The results of all four bandwidth scanners
54   are periodically aggregated as described in the next section.
55
561.1. Configuring and running a Tor client
57
58   All four bandwidth scanners use a single Tor client for their
59   measurements.  This Tor client has two non-standard configuration
60   options set.  The first:
61
62      FetchUselessDescriptors 1
63
64   configures Tor to fetch descriptors of non-running relays.  The second:
65
66      __LeaveStreamsUnattached 1
67
68   instructs Tor to leave streams unattached and let the controller attach
69   new streams to circuits.
70#
71# Why does bwauthority.py set and reset these configuration options when
72# the provided torrc already contains them?  Particularly the resetting
73# part seems to be broken, because bwauthority.py sets
74# __LeaveStreamsUnattached 0 even though other scanners might still be
75# running.  The whole code should be removed from bwauthority.py.  -KL
76
771.2. Connecting to Tor via its control port
78
79   At startup, the bandwidth scanners connect to the Tor client via its
80   control port using cookie authentication.  The bandwidth scanners
81   register for events of the following types:
82
83    - NEWCONSENSUS
84    - NEWDESC
85    - CIRC
86    - STREAM
87    - BW
88    - STREAM_BW
89
90   These events are used to learn about updated Tor directory information
91   and about measurement progress.
92
931.3. Selecting slices of relays
94
95   Each of the four bandwidth scanners is responsible for a subset of
96   running relays, determined by a fixed percentile range of bandwidths
97   listed in the network status consensus.  By default the four scanners
98   are responsible for the relays with consensus bandwidth:
99
100    1. from  0th to  12th percentile (fastest relays),
101    2. from 12th to  35th percentile (fast relays),
102    3. from 35th to  60th percentile (slow relays), and
103    4. from 60th to 100th percentile (slowest relays).
104
105   The bandwidth scanners further subdivide the share of relays they are
106   responsible for into slices of 50 relays to perform measurements.
107
108   A slice does not consist of 50 fixed relays, but is defined by a
109   percentile range containing 50 relays.  The lower bound of the
110   percentile range equals the former upper bound of the previous slice or
111   0 if this is the first slice.  The upper bound is determined from the
112   network status consensus at the time of starting the slice.  The upper
113   percentile may exceed the percentile range that the bandwidth scanner
114   is responsible for, whereas the lower percentile isn't.  The set of
115   relays contained in the slice can change arbitrary often while
116   performing measurements.
117#
118# What if we approach the upper bound of the interval we're responsible
119# for and there are no 50 relays left?  Is the last slice going to have
120# fewer relays, or do we decrease the lower percentile until we have 50
121# relays?  Example: There are 101 relays between 60th and 100th
122# percentile, and we just finished relays 51 to 100.  Is the next slice
123# going to have only 1 relay?  I saw output files from 100th to 102nd
124# percentile on gabelmoo.  How's that possible?  -KL
125#
126# The paragraph above contains a lot of guesswork and may be completely
127# wrong.  But we need some definition of what relays are contained in a
128# slice and whether membership can change over time.  -KL
129
130   A bandwidth scanner keeps measuring the bandwidth of the relays in a
131   slice until:
132
133    - every relay in the slice has been selected for measurement at least
134      5 times, and
135
136    - the number of successful fetches is at least 65% of the possible
137      path combinations (5 x number of relays / 2).
138
139   Note that the second requirement makes no assumptions about successful
140   fetches for a given relay or path.  It is just an abstract number to
141   avoid skipping slices in case of temporary network failure.
142#
143# If selection is random, isn't there a small chance of never picking a
144# relay and never reaching the 5 measurements for this relay?  -KL
145
1461.4. Selecting paths for measurements
147
148   Before selecting a new path for a measurement, a bandwidth scanner
149   makes sure that it has a valid consensus, and if it doesn't, it waits
150   for the Tor client to provide one.
151
152   The bandwidth scanners also check the local system time and avoid
153   starting new measurements between 01:30 and 04:30 local time.
154#
155# Why do the authorities sleep for three hours in the *default*
156# configuration?  It seems useful to have this as a configuration option,
157# but why is it enabled by default?  -KL
158#
159# It seems that after waking up from this 3 hour break, we don't wait for
160# a valid consensus.  Should we?  -KL
161
162   The bandwidth scanners then select a path and instruct Tor to build a
163   circuit that meets the following requirements:
164
165    - All relays for the new path need to be members of the current slice.
166
167    - The minimum consensus bandwidth for relays to be selected is 1
168      KiB/s.
169
170    - Path length is always 2.
171
172    - Selection is uniform, that is, there is no preference for relays,
173      e.g., based on bandwidth.
174
175    - Relays in the paths must come from different /16 subnets.
176
177    - Entry relays must have the Running and Fast flags and must not
178      permit exiting to 255.255.255.255:443.
179
180    - Exit relays must have the Running and Fast flags, must not have the
181      BadExit flag, and must permit exiting to 255.255.255.255:443.
182#
183# If the Fast flag is really required for both positions, does this mean
184# that non-Fast relays are not measured?  How does this work with the
185# criteria to consider a slice finished?  And what if the criteria for
186# assigning the Fast flag are tightened in the future?  -KL
187#
188# The sets of entry and exit relays don't overlap, right?  What if a slice
189# of 50 relays has entry or exit relays, but none of the other set?
190# Right, it's highly unlikely, but does this mean we wouldn't measure
191# anything?  -KL
192#
193# There's even more guesswork involved here.  This needs review!  -KL
194
1951.5. Performing measurements
196
197   Once the circuit is built, the bandwidth scanners download a test file
198   via Tor's SOCKS port using SOCKS protocol version 5.
199
200   All downloads go to same bandwidth authority server.
201
202   All requests are sent to port 443 using https to avoid caching on the
203   exit relay.
204#
205# Do the bandwidth scanners check the result size and/or the bandwidth
206# authority certificate somewhere?  If not, should they?  Otherwise,
207# malicious exits could manipulate their bandwidth weights too easily.
208# -KL
209
210   The requested resource for performing the measurement varies with the
211   lower percentile of the slice under investigation.  The default file
212   sizes by lower percentiles are:
213
214    -  0th to  10th percentile:   2 MiB
215    - 10th to  20th percentile:   1 MiB
216    - 20th to  30th percentile: 512 KiB
217    - 30th to  50th percentile: 256 KiB
218    - 50th to 100th percentile: 128 KiB
219#
220# In choose_url(), we raise an exception saying that no nodes are left for
221# the URL choice, but really we can only run into this exception when we
222# pass a value > 100 for percentile.  -KL
223
224   The bandwidth scanners use the following fixed user-agent string for
225   their requests:
226
227      Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; \
228      .NET CLR 1.0.3705; .NET CLR 1.1.4322)
229
230   Unfinished downloads are aborted after 30 minutes.
231#
232# That's a pretty high timeout, right?  This can slow us down
233# significantly, given that downloads are not run in parallel for a given
234# bandwidth scanner.  A better timeout might be 10 or 15 minutes.  -KL
235#
236# There's a code line "if ret == 1 and build_exit:" with the else case
237# including build_exit in the log message.  What if ret == 0 and
238# build_exit is null?  -KL
239
240   For each download, the bandwidth scanners collect the following data:
241#
242# TODO Most of this happens in TorCtl-land that I'm even less familiar
243# with than with Torflow-land.  Mike, can you give me some pointers what
244# code parts to look at in order to understand which Tor controller events
245# are processed where and what we learn from them?  -KL
246
2471.6. Writing measurement results
248
249   Once a bandwidth scanner has completed a slice of relays, it writes the
250   measurement results to disk.
251
252   The output file contains information about the slice number, the
253   timestamp of completing the slice, and the measurement results for the
254   measured relays.
255
256   Only relays with at least 1 successful measurement, non-negative
257   filtered stream bandwidth, and non-negative stream bandwidth are
258   included in the output file.
259#
260# What's the difference between stream and filtered stream?  -KL
261
262   The filename of an output file is derived from the lower and upper
263   slice percentiles and the measurement completion time.  The format is
264
265      "bws-" lower percentile ":" upper percentile "-done-" timestamp
266
267   Both lower and upper percentiles are decimal numbers rounded to 1
268   decimal place.  The timestamp is formatted "YYYY-MM-DD-HH:MM:SS".
269
270   The first line of an output file contains the slice number:
271
272      "slicenum=" slice number NL
273
274   The second line contains the UNIX timestamp when the output file was
275   written:
276
277      timestamp NL
278
279   Subsequent lines contain the measurement results of all relays in the
280   slice in arbitrary order.  There can be at most one such line per relay
281   identity:
282
283      "node_id=" fingerprint SP
284      "nick=" nickname SP
285      "strm_bw=" stream bandwidth SP
286      "filt_bw=" filtered stream bandwidth SP
287      "desc_bw=" descriptor bandwidth SP
288      "ns_bw=" network status bandwidth NL
289
290   The meaning of these fields is as follows: fingerprint is the
291   hex-encoded, upper-case relay identity fingerprint; nickname is the
292   relay's nickname; stream bandwidth and filtered stream bandwidth
293   contain the average measurements; descriptor bandwidth is the average
294   self-advertised bandwidth contained in relay descriptors; and network
295   status bandwidth is the average relay bandwidth contained in network
296   status consensuses.
297#
298# Which nickname is chosen here if a relay changes its nickname between
299# two measurements?  Does it matter?  -KL
300#
301# Starting to count slices at 0 whenever we start at the lower end of our
302# percentile range seems error-prone.  What if the number of slices
303# changes while we're only half through with all slices?  Isn't there a
304# potential from overlooking results?  Or do we not care about the slice
305# number when aggregating results?  -KL
306
3072. Aggregating scanner results
308
309   Every few hours, the bandwidth scanner results are aggregated in order
310   to include them in the network status consensus process.  This
311   aggregation step looks at the finished measurements, ....
312
3132.1. Connecting to Tor client
314
315# The aggregate script connects to the same Tor client that bandwidth
316# scanners use and requests the currently valid network status consensus
317# from it.  Does that mean we won't have an opinion on relays that are
318# offline right now?  -KL
319
3202.2. [...]
321
322# BETA, GUARD_BETA, ALPHA, and GUARD_ALPHA are all set to 0 in the default
323# configuration.  Is the plan to change their values and use the more
324# complex aggregation mechanism anytime soon?  Or were they only in the
325# code to run experiments and should go away?  -KL
326