Opened 5 months ago

Last modified 8 weeks ago

#29315 accepted enhancement

Write down guidelines for adding new stats

Reported by: karsten Owned by: irl
Priority: Very High Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords: metrics-roadmap-2019-q2
Cc: metrics-team, dgoulet, ahf, juga, teor, phw Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor:

Description

We're going to add a few new stats to Tor Metrics in the next months: BridgeDB, PrivCount, sbws, and maybe more, in no specific order.

Let's write down some initial guidelines what we expect and what others can expect. And let's refine these initial guidelines as we add some actual stats. Once we're happy with them we should put them on Tor Metrics.

Assigning this ticket to myself for now, as I'm going to post a first draft soon. Cc'ing a few folks who we'll be working together with on adding new data.

Child Tickets

Attachments (1)

guidelines.txt (8.4 KB) - added by karsten 5 months ago.
Initial guidelines for adding stats

Download all attachments as: .zip

Change History (17)

Changed 5 months ago by karsten

Attachment: guidelines.txt added

Initial guidelines for adding stats

comment:1 Changed 5 months ago by karsten

Status: assignedneeds_review

There, attached a first draft. Please let me know what you think!

comment:2 Changed 4 months ago by juga

Read it, sounds good to me.
I'd just add the links to all the projects you mention and Tor proposals.
Nitpick: "your data", i never thought i was asking metrics to publish "my data" but rather Tor network data.

comment:3 in reply to:  2 ; Changed 4 months ago by karsten

Replying to juga:

Read it, sounds good to me.
I'd just add the links to all the projects you mention and Tor proposals.

Yep, agreed.

Nitpick: "your data", i never thought i was asking metrics to publish "my data" but rather Tor network data.

Ah, interesting point.

The reason why the document says "you" and "me" is that it's basically the transcription of an imaginary conversation I had with dgoulet or ahf or teor or you. And of course I didn't mean "your data" as in "juga's data", but I think you didn't mean that either.

The decision how to refer to the tool providing data is an important one. I wanted to make clear that Tor Metrics does not suddenly become the owner of the tool, neither with respect to its code nor to its operation, not even of the part that exports data to Tor Metrics. There must be a clear interface, and the owner of the tool providing the data will always stay in charge of the parts beyond that interface from a Tor Metrics point of view. It's not that Tor network data magically lands on Tor Metrics and nobody feels responsible anymore. I think I implicitly wanted to make that clear by using "you" and "we" in the document.

I'm open to different suggestions, though. This is just the very first draft, and we'll improve it as we add more of "your" Tor network data to Tor Metrics. :)

Thanks for the feedback!

comment:4 in reply to:  3 Changed 4 months ago by juga

Replying to karsten:
[...]

Nitpick: "your data", i never thought i was asking metrics to publish "my data" but rather Tor network data.

Ah, interesting point.

The reason why the document says "you" and "me" is that it's basically the transcription of an imaginary conversation I had with dgoulet or ahf or teor or you. And of course I didn't mean "your data" as in "juga's data", but I think you didn't mean that either.

The decision how to refer to the tool providing data is an important one. I wanted to make clear that Tor Metrics does not suddenly become the owner of the tool, neither with respect to its code nor to its operation, not even of the part that exports data to Tor Metrics. There must be a clear interface, and the owner of the tool providing the data will always stay in charge of the parts beyond that interface from a Tor Metrics point of view. It's not that Tor network data magically lands on Tor Metrics and nobody feels responsible anymore. I think I implicitly wanted to make that clear by using "you" and "we" in the document.

That makes sense. I also think that the conversation style probably makes it easier to read.

I'm open to different suggestions, though. This is just the very first draft, and we'll improve it as we add more of "your" Tor network data to Tor Metrics. :)

:)
Certainly seems hard to come up with an alternative without making the style more impersonal or using longer sentences.
I can only come now with adding a note after the the first "your data" something like:
"[*] by your data we don't mean the data belongs to you, but you're one of the persons working on the source that produces the data."
Still fine if you leave it without any note.

comment:5 Changed 4 months ago by irl

Keywords: metrics-roadmap-2019-q2 added

comment:6 Changed 4 months ago by irl

Priority: MediumVery High

comment:7 Changed 4 months ago by irl

Reviewer: irl

Planned for review party tomorrow.

comment:8 Changed 4 months ago by irl

Status: needs_reviewneeds_revision

This is a good start and gets all the right questions asked but I think we can be a bit clearer about what we think the answers to those questions should look like. For example, we don't want to add YAML/TOML/HCL parser libraries if we can avoid it.


What belongs into the data format for the data to be archived?

  • The name of software that produced the result.
  • The version of the software that produced the result.
  • If performing active measurement, the network location of the vantage point (e.g. IP address, ASN, and/or country).
  • If performing passive measurement, an identifier for the object of the observations (may be implicit if the same as the source identifier).

I think we should be clearer about what exactly it is that we can do. We can archive data without doing anything else with it and our bar for that should be a lot lower than committing to analyzing and visualizing the data on an ongoing basis.

What data belongs on Tor Metrics?

This I think should be changed to two sections "What data belongs in CollecTor?" and "What visualizations belong on the Tor Metrics Portal?".

While we've decided not to do the OONI visualization for now, it would belong there even though that data is external.


Would you like to have another pass or should I make these changes and then attach a new revision?

comment:9 in reply to:  8 Changed 4 months ago by juga

Replying to irl:

Would you like to have another pass or should I make these changes and then attach a new revision?

Hmm, wouldn't be easier for you both (or others) to use a git repository to edit this file?.
Ignore the comment if it's just easier with an attachment.

comment:10 Changed 4 months ago by irl

git for a single text file probably isn't making things easier. We could make a pad though.

It would be a good idea to work out what we will do with this document once we're done with it. Does it just go on the website?

comment:11 Changed 4 months ago by karsten

We can make a pad. I'll start one tomorrow and make edits there. My original idea was to put it on the Tor Metrics website once we're done. Or should it rather go elsewhere?

Will work on the earlier comments tomorrow!

comment:12 Changed 3 months ago by phw

Cc: phw added

comment:13 Changed 3 months ago by phw

I just read the guidelines because I want to get BridgeDB statistics into Tor Metrics. Overall, the guidelines were helpful! However, I would have liked to read more about how to move my data over to you. As I understand, I make my data available (e.g., over a webserver?) and you then fetch it, right?

comment:14 Changed 2 months ago by irl

I would like for these systems to be as open/transparent as is possible. The demarcation between a system that collects metrics and Tor Metrics should not just be for Tor Metrics. Anyone should be able to do what Tor Metrics does. This means that services publish data, and we pull from the service.

It does not need to be a web server. If there is not already a webserver then a Gopher server or TCP port that dumps out the document are also fine as far as I'm concerned, maybe karsten has other opinions.

Increasingly I'm thinking that the Tor directory protocol meta format is a good format to have metrics in. We already have parsers for these that are fast and efficient, and it's easier to detect errors due to the strict format (even if #30105 and similar things sometimes slip through). The document format also provides for signing of documents, which I'd like to see more of our data sources doing. #29624 is looking at defining a new format for exit lists, and is using the meta format with Ed25519 signature.

comment:15 in reply to:  14 ; Changed 8 weeks ago by karsten

Replying to irl:

I would like for these systems to be as open/transparent as is possible. The demarcation between a system that collects metrics and Tor Metrics should not just be for Tor Metrics. Anyone should be able to do what Tor Metrics does. This means that services publish data, and we pull from the service.

This sounds like a fine recommendation where this is possible. If a system can sanitize its data by itself before making it available to us and others, great! Let's just be clear that we're shifting complexity and maintenance work from Tor Metrics to services run by others. If they have the resources to do this, okay.

But let's consider whether we want to make this a hard requirement. There may be services where we're glad that somebody runs them and where we cannot expect them to also run sanitizing code. The options in such a case are that we either don't get the data, or we sanitize it somewhere. And if we can choose where to sanitize it, we can either do it as part of a CollecTor module or in a separate tool run on the host that also runs the service. In either case we're providing the sanitized data to others who can then do everything that Tor Metrics does.

However, we discussed this topic before, and it seems we still do not quite agree. Would it help if we made this a hard requirement with the caveat that, if somebody cannot run sanitizing code, we run it on a machine that is not officially part of Tor Metrics?

It does not need to be a web server. If there is not already a webserver then a Gopher server or TCP port that dumps out the document are also fine as far as I'm concerned, maybe karsten has other opinions.

Gopher? My initial reaction is that we shouldn't fall into the same esoterism trap where we also lost Haskell-written TorDNSEL.

I'd say let's strongly recommend a webserver, and if that's not possible, talk to folks.

Increasingly I'm thinking that the Tor directory protocol meta format is a good format to have metrics in. We already have parsers for these that are fast and efficient, and it's easier to detect errors due to the strict format (even if #30105 and similar things sometimes slip through). The document format also provides for signing of documents, which I'd like to see more of our data sources doing. #29624 is looking at defining a new format for exit lists, and is using the meta format with Ed25519 signature.

Sounds good to me, as a recommendation that likely works for most new formats. For example, having sanitized web server logs in the Apache format made sense, because then it was possible to use existing tools to process them. But yes, for most formats this is a fine recommendation.

Would you mind taking the draft and the comments above and writing an updated draft? I feel like if I continue owning this task, we'll need more review rounds. Let me know!

comment:16 in reply to:  15 Changed 8 weeks ago by irl

Owner: changed from karsten to irl
Reviewer: irl
Status: needs_revisionaccepted

Replying to karsten:

Replying to irl:

I would like for these systems to be as open/transparent as is possible. The demarcation between a system that collects metrics and Tor Metrics should not just be for Tor Metrics. Anyone should be able to do what Tor Metrics does. This means that services publish data, and we pull from the service.

This sounds like a fine recommendation where this is possible. If a system can sanitize its data by itself before making it available to us and others, great! Let's just be clear that we're shifting complexity and maintenance work from Tor Metrics to services run by others. If they have the resources to do this, okay.

But let's consider whether we want to make this a hard requirement. There may be services where we're glad that somebody runs them and where we cannot expect them to also run sanitizing code. The options in such a case are that we either don't get the data, or we sanitize it somewhere. And if we can choose where to sanitize it, we can either do it as part of a CollecTor module or in a separate tool run on the host that also runs the service. In either case we're providing the sanitized data to others who can then do everything that Tor Metrics does.

If we are not going to make it work any other way, it is probably better to do the sanitizing in CollecTor than to run it on another machine as this might split our focus and end up with us making mistakes. We could make it a "very strong" recommendation, but then fallback to doing the sanitizing in CollecTor as a last resort.

It does not need to be a web server. If there is not already a webserver then a Gopher server or TCP port that dumps out the document are also fine as far as I'm concerned, maybe karsten has other opinions.

Gopher? My initial reaction is that we shouldn't fall into the same esoterism trap where we also lost Haskell-written TorDNSEL.

Good point. However, what do we mean when we say "web server"? Would we accept a server that only allows SPDY/3 for example? We should pick some client libraries that must have support for at least one of the available protocols.

Increasingly I'm thinking that the Tor directory protocol meta format is a good format to have metrics in. We already have parsers for these that are fast and efficient, and it's easier to detect errors due to the strict format (even if #30105 and similar things sometimes slip through). The document format also provides for signing of documents, which I'd like to see more of our data sources doing. #29624 is looking at defining a new format for exit lists, and is using the meta format with Ed25519 signature.

Sounds good to me, as a recommendation that likely works for most new formats. For example, having sanitized web server logs in the Apache format made sense, because then it was possible to use existing tools to process them. But yes, for most formats this is a fine recommendation.

Right, if there are already well-defined formats for certain structured data we should reuse those.

Would you mind taking the draft and the comments above and writing an updated draft? I feel like if I continue owning this task, we'll need more review rounds. Let me know!

Ok, I'll pick this up and then develop the next version of the draft.

Note: See TracTickets for help on using tickets.