distributing descriptors accross CollecTor instances

changed milestone to %CollecTor 1.1.0

added component::metrics/collector ctip milestone::CollecTor 1.1.0 owner::iwakeh priority::high resolution::implemented severity::normal status::closed type::enhancement labels

The synchronization method still needs to be defined.

a. direct download a. a request protocol similar to the requests directed at the authorities

Further options?

Trac:
Owner: N/A to iwakeh
Status: new to assigned

Fine question. I also thought about that a while ago and concluded (but apparently forgot to mention) that it's fine to simply use metrics-lib's DescriptorCollector to mirror the other instance's recent/ directory and import that. That directory is currently 2.8G and covers roughly 72 hours of data. That would be about 40M new data per hour (per mirror), which we might be able to shrink when using compression. The upside is that the engineering effort for this solution would be almost trivial, because the code already exists and is used by Onionoo, Metrics, and ExoneraTor.

I admit that your option 2 is very tempting, mostly because designing protocols is fun, but also because it would be the more efficient approach with potentially other benefits we don't get with option 1. But it's also a possible time sink of unknown depth. I'd say (but can be convinced otherwise) that we should go for option 1.

I agree, the protocol option is too much implementation effort. The protocol design could be made simple by copying the existing protocol, but implementation of this protocol and having a service up and running all the time answering requests is a lot work and not really necessary.

Regarding you're suggestion for the download-option from 'recent' I'm wondering if this could be designed a little more fine grained, in order to save a bit bandwidth, processing time, and memory? Usually there are only a few descriptors missing and it is easy to determine which document to download. For votes and consensus the download url can be constructed directly and for the referenced descriptors it is possible to infer (using a directory listing from the remote collector instance, e.g. /recent/relay-descriptors/extra-infos/) which doc respective url should provide the missing information. Would that be a feasible approach?

Trac:
Status: assigned to needs_information

Replying to iwakeh:

I agree, the protocol option is too much implementation effort. The protocol design could be made simple by copying the existing protocol, but implementation of this protocol and having a service up and running all the time answering requests is a lot work and not really necessary.

Agreed.

Regarding you're suggestion for the download-option from 'recent' I'm wondering if this could be designed a little more fine grained, in order to save a bit bandwidth, processing time, and memory? Usually there are only a few descriptors missing and it is easy to determine which document to download. For votes and consensus the download url can be constructed directly and for the referenced descriptors it is possible to infer (using a directory listing from the remote collector instance, e.g. /recent/relay-descriptors/extra-infos/) which doc respective url should provide the missing information. Would that be a feasible approach?

My sense is that we shouldn't worry about bandwidth, processing time, and memory yet but instead go for the solution that takes the least engineering effort and is hence potentially more robust.

But I also don't fully understand your suggestion above. Sure, votes and consensuses and in general all files containing just a single descriptor could be skipped just from looking at file name, file size, or file last modified time. But how would we handle files containing dozens or even hundreds of descriptors? It seems that those files would be different in almost all cases, except when two instances download the exact same descriptors in a given hour, which won't happen if one instance reads cached-* descriptors or another instance fetches a missing descriptor from a third instance.

Overall, I think I'd rather want us to keep things simple here for now and think about optimizing later. What do you think?

Replying to karsten:

... Overall, I think I'd rather want us to keep things simple here for now and think about optimizing later. What do you think? Yes , agreed. Let's just go with the simple download solution and first focus on the tagging of the descriptors from different sources.

There are several approaches here:

Only one @source tag from the direct provider. Implications:

a descriptor downloaded from an authority A will be tagged with A's IP.
a descriptor downloaded from another CollecTor B, which received it from authority A, will be tagged with B's IP.
a descriptor downloaded from another CollecTor B, which received it from CollecTor C, will be tagged with B's IP.
a known descriptor downloaded again from a different source will be ignored.

Several @source tags (no duplicates) from various direct providers. Implications hereof are the same as above except for the last one: every time a descriptor is seen a @source tag is added to the descriptor.
Create a structure of source tags that keeps the initial source. This quickly turns into a very complex situation.

Wanting to keep it simple and considering that at first the trustworthiness of all synchronizing CollecTors is established externally 1. or 2. might be the way to go. The design should allow for a later extension and more complex approach of source designation.

One question about the initial description:

It's important that we'd only add those @source annotations to archived descriptors, not to recent descriptors, or we'd serve those descriptors as new every time we're adding a @source. How can the sources be determined when archiving?

I haven't fully made up my mind about the following, but maybe it's food for thought:

Recently, gabelmoo's cached-descriptors file contained hundreds of server descriptors that had no corresponding extra-info descriptors. We cannot blame gabelmoo for accepting these valid descriptors, and even if we were to add a @source tag to these descriptors saying they came only from gabelmoo, we wouldn't later go and delete all descriptors by gabelmoo. The real problem is that anyone can produce as many descriptors as they want. Neither of the solutions above (which are based on our previous discussions) would help here.

I believe the only fix is to discard relay and bridge descriptors that are not referenced from votes or consensuses. And I know that I stated earlier that I'd also want to archive other descriptors. But I don't see yet how to achieve both.

From an implementation point of view we could build this in two phases: 1. we fetch from other CollecTor instances and believe everything we get without attaching @source tags, and 2. we create a staging area of some sort where we store descriptors that are not referenced yet and delete them after a week or so unless we see a descriptor we trust that references them. It's probably smart to do 1. first in order to make CollecTor more robust. We'll have to repackage old tarballs anyway after implementing 2., so there's no big rush there.

Again, not sure yet what to do here. Sorry for the confusion, but it seems it's not easy to do this right.

Some thoughts:

=== The CollecTor side Maybe CollecTor (or the Metrics Team) needs a data collection and handling policy? (Or, is there anything like that I didn't find yet other than the license and of course the Tor-wide privacy goals?)

In general, CollecTor shouldn't attempt to make received data better than it is by dropping unwanted things. At least not without some defined process. And collected data should only be changed when there is a reason for obfuscation or when it is enhanced (e.g. by adding the @source tag).

=== Handling of //unwanted// data Incomplete unreferenced server descs could be stored differently:

referenced server descs can be stored in the way it is done now and
unreferenced can be kept, but stored seperately.

The synch-process could first concentrate on the referenced descriptors.

=== Regarding the repeated uploads: What is the reason for all these server descriptors gabelmoo received? Is there some benign explanation for the uploads?

There are two routers uploading more than 5000 server-descriptors in less than an hour:

router ThePuppetMasterIN 94.23.181.19 9001 0 9030
router ThePuppetMasterMID 94.23.181.18 9001 0 9030

grep -c "PuppetMasterIN" /tmp/2016-06-23-11-05-13-server-descriptors 
1800
grep -c "PuppetMasterMID" /tmp/2016-06-23-11-05-13-server-descriptors 
3596

These two routers shouldn't upload descriptors again and again. The descriptors do not differ in relevant fields according to dir-spec. Is this not a problem that should be tackled on the Tor side?

Maybe, we should actually search the old data for more upload frencies like the one triggering this discussion?

Replying to iwakeh:

Some thoughts:

=== The CollecTor side Maybe CollecTor (or the Metrics Team) needs a data collection and handling policy? (Or, is there anything like that I didn't find yet other than the license and of course the Tor-wide privacy goals?)

There is no explicit policy like that, but it would be useful to document that in the medium term.

I guess a CollecTor policy would make more sense than one that applies to all metrics-related products, because then we'd have to either enforce that policy for all metrics-related tools or manually confirm that a tool conforms to the policy. Other tools could have their own policies.

In general, CollecTor shouldn't attempt to make received data better than it is by dropping unwanted things.

Agreed, and a nice way to phrase this. :)

At least not without some defined process. And collected data should only be changed when there is a reason for obfuscation or when it is enhanced (e.g. by adding the @source tag).

Look, that's the beginning of a policy! I like that.

=== Handling of //unwanted// data Incomplete unreferenced server descs could be stored differently:

referenced server descs can be stored in the way it is done now and

unreferenced can be kept, but stored seperately.

The synch-process could first concentrate on the referenced descriptors.

I'm not sold on this part with respect to the process. I can see how we're switching from a model where we're trusting everyone (all relays and bridges, all directory authorities, all other CollecTor instances) to just a small set of nodes (for example, the set of directory authorities listed in tor.git at a certain point in time). But doing so is a major engineering effort, whereas continuing to trust everyone and risking to get spammed is easy. Also, once we limit trust we can always go through the tarballs and rip out everything we shouldn't have accepted. Hence, I'd say let's handle all data, wanted or unwanted, the same for now.

But in the future, yes, let's consider doing this. Once we do we should talk to ln5 about his plans to apply certificate transparency concepts to create a Tor network data archive, where spam descriptors turned out to be a major issue, too.

=== Regarding the repeated uploads: What is the reason for all these server descriptors gabelmoo received? Is there some benign explanation for the uploads?

Probably not. But even if we find the reason and fix this, we cannot undo that it happened in the past, we cannot guarantee that there will be no future bugs like this one, and we cannot prevent malicious relays from flooding the directory authorities with random descriptors without there being a bug. Or did you mean that directory authorities shouldn't accept as many descriptors from a single source? I'm not sure how that would work, and for the directory authorities it's not that much of a problem to get spammed temporarily. So, I think we might not be able to fix our issue with spam descriptors in the tor daemon.

Maybe, we should actually search the old data for more upload frencies like the one triggering this discussion?

We could, but what would we do once we find similar events? When does a malicious descriptor flood begin and what's still expected behavior? I think if we want to solve the descriptor spam problem we'll have to limit ourselves to descriptors published by trusted entities and descriptors referenced from such descriptors directly or indirectly.

Sorry for the long response. It's a difficult problem, it seems.

Added to second release milestone.

Trac:
Milestone: N/A to CollecTor 1.1.0

depends on #19791 (moved)

Moving forward here, after thinking about this problem a bit more. I'd say let's give up on the @source annotation idea and, for now, simply trust whatever we get from other (trusted) CollecTor instances. The goal should be to start syncing data soon to finally turn the single point of failure into many. And if the spam problem turns out to be a real problem, let's solve it. However, let's keep potentially malicious CollecTor instances in mind by taking the following precautions:

Allow the operator not only to configure which CollecTor instances to sync from, but also let them configure which descriptor types to sync from a given instance. This includes looking at synced descriptor contents and skipping unwanted descriptor types (example: bridge descriptor "accidentally" contained in synced relay descriptor files). For example, it makes little sense for the primary CollecTor instance to sync bridge descriptors from anywhere, because it's the only source for them. (Oh, while writing this, please disregard this suggestion if the plan was to limit this feature to relay descriptors anyway.)
Check whether the local instance already contains synced data and only store remote data if it's better than local data. For example, it might be that a remotely obtained consensus contains fewer signatures than the local copy of that consensus, in which case the local copy should be kept. But in some cases it's worth adding parts of remote data or even replace local data, after being sure that no information gets lost. Requires a per-case consideration. (Note that this enhancement is not specific to syncing from CollecTor mirrors but that it also makes sense to make it for fetching from different directory authorities. It just gets even more important now.)

What precautions did I miss? And what else is missing to build this?

depends on #19934 (moved) (and #19791 (moved) as mentioned above)

The following is a summary of the discussion above and elsewhere, and should give an overview of the first sync-version functionality.

== Functionality and design of descriptor distribution in CollecTor 1.1.0 === Configuration

General settings Add a SyncManager configuration in the Scheduler section of the properties file. Property SyncFolder contains the path for storing the downloded descriptors.
Choice of sync-sources Add a configuration property SyncSources containing an array of strings specifying a source name and source URL for each CollecTor instance to retireve descriptors from. This setup is similar to the current torperf configuration.
Choice of descriptors Add a configuration property SyncDescriptorLists, which will contain comma separated lists (separated by space) with a source name defined in SyncSources and a list of descriptor designations.
Backup of replaced local files if KeepReplaceBackup is set to true, keep a copy of the old local descriptors in BackupFolder. === SyncManager The SyncManager module will be started by the Scheduler accordinng to the configuration defined above. Each SyncManager run will perform the following steps: a. Retrieve descriptors from the CollecTor instances defined in SyncSources. These descriptors are stored in SyncFolder under the host part of the instance's url, e.g. my-sync-folder/collector.torproject.org/recent/exit-lists for exitlists from the main instance. b. Following retrieval the fetched descriptors are examined: i. discard descriptor files that do not contain what they should (see comment:11) and log a warning with sync-source info and reason (see criteria). i. move valid descriptors (see criteria) without a pre-existing local copy to the localstore. i. if there is a local copy already, decide which copy to keep (see criteria). I. local copy is kept, log debug message with source and reason and delete fetched descriptor. I. local and fetched are identical, log debug message with source and reason and delete fetched descriptor. I. fetched copy should replace local descriptor. Depending on KeepReplaceBackup move local copy to BackupFolder and move fetched copy to main storage. If KeepReplaceBackup is false, replace local copy by fetched. In all cases log debug message with source and reason.

=== Replacement criteria As the replacement criteria are not fully defined yet and it is very likely that there will be more criteria in future a modular/pluggable approach seems useful, i.e.:

define KeepCriterium and ReplaceCriterium interfaces
register implementing classes with the SyncManager, which will apply these for the selection steps described above.

== Open Questions A. Which KeepCriterium and ReplaceCriterium classes shuld be implemented initially? currently there are

a ReplaceCriterium keep the consensus with more signatures and
a KeepCriterium only keep descriptors that contain what they claim to be.
More criteria that should be implemented with release 1.1.0? A. Should the applied criteria be configurable? E.g. this could be done by listing the classes in collector.properties, but we have already more than fifty config settings, which is a lot. A. The data combination mentioned in comment:11 part two is not yet considered, but the design will be open to add this later. Anyway some questions: What kind of data enhancement could be there? What about descriptor signatures?

Set to high in order to solve the open questions quickly.

Trac:
Priority: Medium to High

Hmm, the suggested config options would imply that there's only one new sync manager module that syncs all descriptors from the various sources and that runs, say, once per hour? I wonder how to schedule that in a way that it does not interfere with the other modules. So far, modules were pretty much independent, but this new module would create a dependency between modules.

Alternative suggestion: we add four (sets of) configurations, one for each module, that internally re-use the same code for syncing descriptors and for importing them. For example, SyncRelayDescriptors, SyncBridgeDescriptors, SyncExitLists, and SyncTorperfFiles. We could then provide a remote path where to find descriptor files (like /recent/relay-descriptors/) and could implictly only consider descriptor types that the respective module understands (like RelayServerDescriptor, RelayExtraInfoDescriptor, etc., but not BridgeServerDescriptor).

(If we're worried that there are too many config options already, I'm more than happy to make a list of options that can go away! But this shouldn't mean we should hold back useful new options.)

Here's a potential policy we could apply to decided whether to keep a local or remote descriptor: while syncing, if we find out that a remotely obtained descriptor would be stored under a file name that already exists locally, we always discard that; and while processing descriptors locally, if we find that we already have a file locally with different content, which we likely received while syncing, we always overwrite that. This means that we're only adding data but never replacing data.

Regarding deleting synced descriptors, we should never do that, but we should rather let DescriptorCollector clean up the local directory when it finds that a local file does not exist anymore remotely.

Here's something else to watch out for while writing this code: whenever we learn descriptors from syncing, we'll have to include them in our /recent/ directory, too. This wasn't entirely clear to me from the description above, so if this was already the plan, never mind.

Thanks for the remarks and suggestions! I'm replying inline below and also add a wiki page CollecTor Sync that contains the current status of the discussion. Please, take a look there to see the entire picture.

Replying to karsten:

Hmm, the suggested config options would imply that there's only one new sync manager module that syncs all descriptors from the various sources and that runs, say, once per hour? I wonder how to schedule that in a way that it does not interfere with the other modules. So far, modules were pretty much independent, but this new module would create a dependency between modules.

You're right, they should stay independent. I intended that, too, but I had a different (more complicated) architecture in mind.

Alternative suggestion: we add four (sets of) configurations, one for each module, that internally re-use the same code for syncing descriptors and for importing them. For example, SyncRelayDescriptors, SyncBridgeDescriptors, SyncExitLists, and SyncTorperfFiles.

Good idea! So we run the sync-function after or instead of the module run (see wiki page for more).

We could then provide a remote path where to find descriptor files (like /recent/relay-descriptors/) and could implictly only consider descriptor types that the respective module understands (like RelayServerDescriptor, RelayExtraInfoDescriptor, etc., but not BridgeServerDescriptor).

Actually, the directory structure of a CollecTor's 'recent' is given, i.e. the different mirrors won't or shouldn't use a different directory sructure than the main instance. So, it suffices to activate the module and set the sync or sync-only option. The path structure for the actual download is determined. The straightforward paths for torperf and exitlists and the more complex structure for bridge- and relay-descriptors.

Here's a potential policy we could apply to decided whether to keep a local or remote descriptor: while syncing, if we find out that a remotely obtained descriptor would be stored under a file name that already exists locally, we always discard that;...

So, //while syncing// means while retrieving descriptors from a different instance and writing them to the local SyncFolder structure. And, during this process descriptors already available in the sync-folder are not replaced.

... and while processing descriptors locally, if we find that we already have a file locally with different content, which we likely received while syncing, we always overwrite that. This means that we're only adding data but never replacing data.

This refers to the process of comparing the descriptors fetched from remote instances with descriptors already in the 'recent' folder of the syncing instance? Such local descriptors could have been obtained by direct download or a different syncing operation. Did I miss something here?

Regarding deleting synced descriptors, we should never do that, but we should rather let DescriptorCollector clean up the local directory when it finds that a local file does not exist anymore remotely.

True, if this refers to descriptors in the SyncFolder.

Here's something else to watch out for while writing this code: whenever we learn descriptors from syncing, we'll have to include them in our /recent/ directory, too. This wasn't entirely clear to me from the description above, so if this was already the plan, never mind.

That was intended, but should be clearly stated; will be added to the wiki page.

Hope I don't see things too complicated.

Alright, I read the wiki page and the comment above. Just two clarifications of what I meant above:

The only code with permission to write to (and delete from) SyncFolder should be DescriptorCollector, which would also delete files as soon as they disappear remotely. We shouldn't move away or delete files while going through that directory and looking at descriptors, because that would mean that DescriptorCollector would have to download them again next time. Every time it runs.
There will now be two cases where we want to write a descriptor and need to check if we already have it: 1) when downloading it locally and 2) when syncing from another CollecTor instance. In the first case, if a file already exists and has different contents, we now overwrite it. It could be a descriptor we synced from another instance in a previous run or obtained earlier by downloading it from somewhere. Note that we're looking at our out/ directory to decide whether we already have a descriptor, not at our recent/ directory. In the second case, we simply don't store the descriptor. I think that's fine as initial strategy.

If this all makes sense, feel free to work on the code, and I'll take a look once there's something to review. If it doesn't make sense yet, feel free to ask more questions. Thanks!

Thanks for the detailed replies! Setting to assigned.

Trac:
Status: needs_information to assigned

Please review the seven commits on top of this branch.

There are two new packages: sync for sync-merge functionality and persist as new modular way to persist descriptors (currently file-system, but could be extended or changed in future). The latter should step by step be used for all persisting of descriptors, i.e. be used instead of the store* methods throughout the various modules. (that is useful for removing the tight circular coupling of ArchiveWriter, DescriptorDownloader and DescriptorParser for example).

Persisting is based on DescriptorPersistence defining methods for storing. The classes extending DescriptorPersistence just need to define the explicit storage path. For convenience PersistenceUtilsprovides date-time to string methods. Thus, providing methods for code that is repeatedly defined in the current code base.

CollecTorMain extends SyncManager in a way that all synchronization options can be configured during runtime, i.e. syncing of a module can be turned on or off and sources can be changed without restart.

(I'll add package-info later for the two packages.)

distributing descriptors accross CollecTor instances

Child items ...

Activity