Ticket #29315: guidelines.txt

File guidelines.txt, 8.4 KB (added by karsten, 4 months ago)

Initial guidelines for adding stats

Line 
1Guidelines for adding data to Tor Metrics
2
3You're developing a tool that is in some way related to the publicly deployed
4Tor network?
5
6You want to give us your data, so that we can archive, publish, and possibly
7aggregate and visualize it?
8
9Sounds great! We should work together on this! Here are some guidlines for
10making this process as smooth as possible.
11
12What data belongs on Tor Metrics?
13
14 - If it happens in the public deployed Tor network it likely belongs on Tor
15   Metrics.
16
17 - If it happens for a short term only, like for a research project, it's
18   unlikely worth the effort to have Tor Metrics archive, publish, aggregate,
19   and visualize it. In this case you should collect the data yourself (keeping
20   in mind research ethics!), and we can later talk about linking to it or even
21   using it as external data.
22
23 - If your data is a combination of existing data on Tor Metrics plus maybe
24   external data, we shouldn't add it, either. In such a case we should rather
25   talk about extending our services towards what your service does, if that
26   makes sense.
27
28What data do you want to see on Tor Metrics?
29
30 - What is your data about? Is it about servers or users or both? Is it
31   passively gathered or actively measured or both?
32
33 - Is there a way for you to aggregate the data before you hand it over to us?
34   Of course this requires more thinking upfront, but it's a great way to ensure
35   not to give out too sensitive data to us or anyone else. It's not always
36   possible or even useful to aggregate data and discard the original data,
37   though. Two examples:
38
39   - Relays count how many clients download the consensus from them and from
40     which country they connect. When 24 hours have passed, they include the
41     count by country in their next extra-info descriptor. This is aggregated
42     data. The obviously more sensitive, non-aggregated variant would be for
43     relays to provide a log of clients downloading consensuses.
44
45   - The torproject.org webservers keep highly sanitized logs of web clients
46     making requests to them that we sanitize even more before we archive them.
47     This is non-aggregated data. The possibly less sensitive aggregated variant
48     would be for webservers to count requests by requested URL or similar.
49
50 - Is the data you're planning to give us too sensitive? If so, can you sanitize
51   it yourself before giving it to us (we can help you with that), or does the
52   sanitizing need to happen on our side (we should still involve you in this
53   case)?
54
55 - When is your data available and for how long? Ideally, we'd survive reboots
56   or downtimes on our side for up to 72 hours without losing any of your data.
57   Typically, you'd implement this using a cache. If that is hard or impossible
58   to do on your side, we'll have to think about adding redundancy on our side.
59   That's all possible and we did it before, it'll just make the process take
60   longer.
61
62 - Do you expect any difficulties on our side to write code that processes your
63   data? If we only need to fetch and store your data, probably not. But if we
64   have to inflate, parse, verify, combine, sanitize, split, and deflate your
65   data, maybe. And if we need to include fancy crypto libraries in order to
66   process your data, then for sure. Any intuitions you have about possible
67   difficulties would be good to know, even if things turn out to be easier in
68   the end.
69
70 - How much data do you think you'll give us over the next five years? A
71   ballpark figure is fine, like the number of bytes as a power of ten.
72
73What belongs into the data format for the data to be archived?
74
75 - Timestamp: We're using the timestamp to place the data item into the right
76   archive file, among other things. Exception: microdescriptors do not contain
77   a timestamp, which makes them a pain to archive.
78
79 - Source identifier: Ideally, we'd expect a cryptographic identifier of the
80   source, but if that is not available, any identifier will do. Exception: exit
81   lists do not contain a source identifier, because there happened to be just
82   one exit list scanner in the network; you can see how this doesn't scale so
83   well.
84
85 - Signature: The signature is the proof that the source produced the data item,
86   not us. And even if we don't verify all signatures, others might want to do
87   that. Exception: hello, exit list, you again!
88
89You're still reading, so it seems that we caught your interest! How should we
90start?
91
92 - Is the data already publicly available somewhere and all you want is discuss
93   a way to include it in Tor Metrics? That's easy then. Just share with us what
94   you have and we can talk.
95
96 - If the data is not public yet, do you maybe have a data format that we can
97   discuss? Bonus points if it comes with samples, but only if you're absolutely
98   certain that the data is safe to be published.
99
100 - If you have none of the above, can you share logs with us, so that we can
101   help you derive a possible data format? It doesn't need to be recent logs
102   (even though time might not magically make your data safe to be published).
103   You could edit the logs and take out any parts you think are too sensitive.
104   And you should encrypt the data before sending it to us.
105
106 - If you have nothing at all yet, let's talk anyway. Describe to us what you
107   think would be good to include in Tor Metrics, and we'll figure something
108   out.
109
110What are we going to do?
111
112It's a process to get your data on Tor Metrics, and not a short one. Let's go
113through the necessary steps for doing it. After each step we should together
114decide whether we're ready to move forward, need to take a step back, or maybe
115even stop the project, because we found out that it's not what we wanted.
116
117 - If you can, give us a few months as heads-up. Ideally, it won't take us that
118   long to do this project, but we'd prefer to make room for it in our next
119   six-month roadmap. Otherwise we might not be able to do it right away.
120
121 - We discuss your data format with you and other Tor developers on the public
122   tor-dev@ mailing list. Maybe you or we need to write a Tor proposal for this.
123
124 - We write a documentation page for the data format plus any necessary
125   sanitizing steps. See the Tor Metrics website and the tor-spec Git repository
126   for a couple of examples.
127
128 - We write code for metrics-lib and/or Stem to parse your data and verify the
129   data format. At this point we'll find out if there are any misunderstandings
130   regarding data types or data structure that we haven't seen before.
131
132 - We write code for CollecTor to fetch and archive your data, but without
133   publishing just yet. As part of this we also agree on file names and URLs
134   where your data will later be available.
135
136 - We make a one-time visualization using your data, mostly as a sanity check.
137   You'd be surprised how many issues are hiding well enough that we would
138   otherwise not find them.
139
140 - At this point we can think about adding your data to our services like
141   Onionoo, Relay Search, and ExoneraTor and our visualizations on Tor Metrics.
142   Typically, we'd do that as a separate project, though.
143
144 - Finally we make your data available for download on CollecTor and put the
145   documentation on the Tor Metrics website. We announce that your data is now
146   on Tor Metrics.
147
148What next?
149
150Congratulations, your data is now on Tor Metrics. But that's not the end of the
151story! Here's what we need you to do as long as we have your data:
152
153 - Make sure that we always get the data by whatever means we came up with
154   together. Avoid longer downtimes and fix any related issues in a timely
155   fashion. We do care about this, because people will come to us and complain
156   that "our" data is not up-to-date, when it may in fact be your fault.
157
158 - If you're planning to make any changes that affect the data format or the way
159   how the data comes to us, talk to us beforehand with enough time to make such
160   changes. Several weeks in advance would be good, because we may have to
161   inform our users about upcoming changes and give them some time to update
162   their tools.
163
164 - Let's be honest: we had to remove data from Tor Metrics in the past, because
165   the services providing them have become unreliable or unmaintained. In such a
166   case we'd talk to you and try to improve the situation. But if that doesn't
167   work, we'd remove your data from Tor Metrics with enough heads up time for
168   you and others to prepare. We'd very likely archive your data and keep it
169   around in such a case. Sorry, and thanks for understanding!
170