Opened 10 months ago

Last modified 10 months ago

#29448 new project

Provide a dir-spec implementation that serves sanitised descriptors

Reported by: irl Owned by: sysrqb
Priority: Low Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Normal Keywords:
Cc: metrics-team, dgoulet, cohosh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The Metrics Team currently performs sanitizing of bridge descriptors before publishing them on CollecTor (and subsequently feeding them into other software).

The published descriptors are detailed in:

https://metrics.torproject.org/collector.html#bridge-descriptors

The sanitizing steps are detailed here:

https://metrics.torproject.org/bridge-descriptors.html

The descriptors are transferred to the CollecTor host unsanitized by means of rsyncing a tarball. This violates one of the Tor Metrics principles in that this is a private interface and we are handling sensitive data. While the data is then sanitized and published, it is not possible for others to operate their own CollecTor instance that fetches data directly from the BridgeDB instance. Additionally, this increases code complexity in CollecTor as now we must treat the fetching of relay and bridge descriptors differently.

Ideally the sanitizing steps would be performed by BridgeDB and then we would be able to reuse (at least large chunks of) CollecTor code that currently fetches relay descriptors.

This is a project that would need co-ordination with the Metrics Team on the best way forward.

Child Tickets

Change History (6)

comment:1 Changed 10 months ago by karsten

Priority: MediumLow
Status: newneeds_information

Uhm, wait, we're not violating one of the Tor Metrics principles here. The principle we put up is that we're not going to use unpublished data. And we're not doing that. We're sanitizing the descriptors before publishing them on CollecTor and before processing them in any other of our tools. We're not touching the unsanitized bridge descriptors for anything else.

So, the goal here is basically to extract the sanitizing code from CollecTor and put it on the BridgeDB host, probably rewritten in a different language. Right?

I can see the benefits you mentioned. I'm all for removing code from our codebase and reducing future maintenance effort!

However, I can also see the downsides: code complexity of BridgeDB will suddenly increase, and whoever runs BridgeDB has one more complex thing to take care of. I'd say, given that we're not violating our principles and that we didn't plan for this work as part of our roadmap, we should set priority to low for now.

Let's also make sure to coordinate with BridgeDB folks before somebody starts writing new code. Setting to needs_information for that last part.

By the way, regardless of this specific situation, this is an interesting discussion for newly added data in general: where do we sanitize data that is too sensitive to be published as is, and who gets to keep that code? Let's discuss that more on #29315.

comment:2 Changed 10 months ago by irl

Is it currently possible for someone to operate their own CollecTor instance and archive bridge descriptors? The answer is no unless they are syncing from our CollecTor instance.

We have access to bridge IPs, which is sensitive information, regardless of whether or not we publish that information. This is a violation of not handling sensitive information.

So, the goal here is basically to extract the sanitizing code from CollecTor and put it on the BridgeDB host, probably rewritten in a different language. Right?

Yes.

However, I can also see the downsides: code complexity of BridgeDB will suddenly increase, and whoever runs BridgeDB has one more complex thing to take care of.

We do get the benefit that we no longer have to handle bridge IPs and things are more reproducible. It is also easier for people to run testing BridgeDBs with a testing CollecTor instance. It is also easier for people to run their own production BridgeDBs that we can see statistics of (which is a goal that has been previously discussed, to reduce reliance on the single BridgeDB instance and allow orgs to set up their own).

comment:3 in reply to:  2 Changed 10 months ago by karsten

Replying to irl:

Is it currently possible for someone to operate their own CollecTor instance and archive bridge descriptors? The answer is no unless they are syncing from our CollecTor instance.

That's true, but it's also an okay answer. If we wanted to fix just this, we could run a separate CollecTor instance with just the bridge descriptor sanitize on a separate host and have everyone (including us) sync from that. (I'm not suggesting we do that.)

We have access to bridge IPs, which is sensitive information, regardless of whether or not we publish that information. This is a violation of not handling sensitive information.

While I'd prefer not to handle sensitive information, I don't consider this a violation of a principle. Of course, if the BridgeDB folks would run this sanitizing code, that would mean that fewer people have access to sensitive information. While this is preferable, I wouldn't say that the current setup is bad per se.

So, the goal here is basically to extract the sanitizing code from CollecTor and put it on the BridgeDB host, probably rewritten in a different language. Right?

Yes.

However, I can also see the downsides: code complexity of BridgeDB will suddenly increase, and whoever runs BridgeDB has one more complex thing to take care of.

We do get the benefit that we no longer have to handle bridge IPs and things are more reproducible. It is also easier for people to run testing BridgeDBs with a testing CollecTor instance. It is also easier for people to run their own production BridgeDBs that we can see statistics of (which is a goal that has been previously discussed, to reduce reliance on the single BridgeDB instance and allow orgs to set up their own).

Agreed on these.

However, in the course of adding more pros and cons, I'd like to add another aspect: years ago we had to give up on collecting bridge pool assignments because the BridgeDB folks back then didn't care enough. It would be sad to lose sanitized bridge descriptors because BridgeDB suddenly gets less attention than it should. Syncing unsanitized descriptors seems like a minimal thing that BridgeDB folks can keep up over a long time. If they need to do more, maybe they'll at some point stop doing it at all.

comment:4 Changed 10 months ago by irl

Syncing unsanitized descriptors seems like a minimal thing that BridgeDB folks can keep up over a long time.

So the reason for this ticket being filed is that the BridgeDB folk were confused by how any of this even worked and contacted me because they thought our current sync mechanism was broken.

If they need to do more, maybe they'll at some point stop doing it at all.

I'm not convinced by this argument given the capacity problems that also exist in the Metrics Team.

I spoke with dgoulet about this yesterday, the resolution is that moving the sanitizing code is something that we should do but there are other things that need to be done in BridgeDB first to support that. For now this task is lower priority but we both agree that it should be done eventually.

Perhaps we should also add another ticket to expose the bridge pool assignments again via a similar mechanism.

comment:5 in reply to:  4 Changed 10 months ago by karsten

Replying to irl:

Syncing unsanitized descriptors seems like a minimal thing that BridgeDB folks can keep up over a long time.

So the reason for this ticket being filed is that the BridgeDB folk were confused by how any of this even worked and contacted me because they thought our current sync mechanism was broken.

Okay. Sounds like this should be documented on the BridgeDB side then. At least as long as we do it this way.

If they need to do more, maybe they'll at some point stop doing it at all.

I'm not convinced by this argument given the capacity problems that also exist in the Metrics Team.

Keep in mind that bridge auth/BridgeDB ownership has changed several times over the years. And for the short term future I don't expect to spend much on this code on the metrics side, which might of course change in 6 or 12 months with the CollecTor rewrite.

I spoke with dgoulet about this yesterday, the resolution is that moving the sanitizing code is something that we should do but there are other things that need to be done in BridgeDB first to support that. For now this task is lower priority but we both agree that it should be done eventually.

Agreed.

Perhaps we should also add another ticket to expose the bridge pool assignments again via a similar mechanism.

Yes, I'm fine with that. Do you want to add that ticket?

comment:6 Changed 10 months ago by irl

Status: needs_informationnew

Filed #29480.

Note: See TracTickets for help on using tickets.