Opened 3 years ago

Closed 2 years ago

#18910 closed enhancement (implemented)

distributing descriptors accross CollecTor instances

Reported by: iwakeh Owned by: iwakeh
Priority: High Milestone: CollecTor 1.1.0
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: ctip
Cc: karsten Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Karsten's suggestions:

Spam Prevention

... a working solution for fetching descriptors from other CollecTor hosts without risking being spammed forever: we simply add *multiple* @source annotations to a descriptor, one for each source (directory authority IP, other CollecTor host IP,
etc.). If we later find out that one source was spamming us, we can easily delete all descriptors that *only* have the @source annotation with the spamming host's IP address.

Here's an example how the tor daemon annotates descriptors:

@uploaded-at 2016-04-18 18:49:25
@source "81.17.16.43"
router pairoj 81.17.16.43 443 0 80
platform Tor 0.2.6.10 on Linux
[...]

It's important that we'd only add those @source annotations to
archived descriptors, not to recent descriptors, or we'd serve those
descriptors as new every time we're adding a @source.

It would also be useful to have stats on the number of newly added
@source annotations per hour, so that we learn if we're getting
spammed, and to have a script for deleting descriptors that only have
a given @source annotation.

Statistics

... one nice thing we could do here is get statistics on
descriptor completeness out of the box: we just count how many
descriptors have @source annotations from known CollecTor mirrors vs.
directory authorities or from wherever we're fetching from. That will
tell us immediately how many descriptors we'd have missed without mirrors.

Child Tickets

Change History (102)

comment:1 Changed 3 years ago by iwakeh

Owner: set to iwakeh
Status: newassigned

The synchronization method still needs to be defined.

  1. direct download
  2. a request protocol similar to the requests directed at the authorities

Further options?

comment:2 Changed 3 years ago by karsten

Fine question. I also thought about that a while ago and concluded (but apparently forgot to mention) that it's fine to simply use metrics-lib's DescriptorCollector to mirror the other instance's recent/ directory and import that. That directory is currently 2.8G and covers roughly 72 hours of data. That would be about 40M new data per hour (per mirror), which we might be able to shrink when using compression. The upside is that the engineering effort for this solution would be almost trivial, because the code already exists and is used by Onionoo, Metrics, and ExoneraTor.

I admit that your option 2 is very tempting, mostly because designing protocols is fun, but also because it would be the more efficient approach with potentially other benefits we don't get with option 1. But it's also a possible time sink of unknown depth. I'd say (but can be convinced otherwise) that we should go for option 1.

comment:3 Changed 3 years ago by iwakeh

Status: assignedneeds_information

I agree, the protocol option is too much implementation effort. The protocol design could be made simple by copying the existing protocol, but implementation of this protocol and having a service up and running all the time answering requests is a lot work and not really necessary.

Regarding you're suggestion for the download-option from 'recent' I'm wondering if this could be designed a little more fine grained, in order to save a bit bandwidth, processing time, and memory?
Usually there are only a few descriptors missing and it is easy to determine which document to download. For votes and consensus the download url can be constructed directly and for the referenced descriptors it is possible to infer (using a directory listing from the remote collector instance, e.g. <other instance>/recent/relay-descriptors/extra-infos/) which doc respective url should provide the missing information.
Would that be a feasible approach?

comment:4 in reply to:  3 ; Changed 3 years ago by karsten

Replying to iwakeh:

I agree, the protocol option is too much implementation effort. The protocol design could be made simple by copying the existing protocol, but implementation of this protocol and having a service up and running all the time answering requests is a lot work and not really necessary.

Agreed.

Regarding you're suggestion for the download-option from 'recent' I'm wondering if this could be designed a little more fine grained, in order to save a bit bandwidth, processing time, and memory?
Usually there are only a few descriptors missing and it is easy to determine which document to download. For votes and consensus the download url can be constructed directly and for the referenced descriptors it is possible to infer (using a directory listing from the remote collector instance, e.g. <other instance>/recent/relay-descriptors/extra-infos/) which doc respective url should provide the missing information.
Would that be a feasible approach?

My sense is that we shouldn't worry about bandwidth, processing time, and memory yet but instead go for the solution that takes the least engineering effort and is hence potentially more robust.

But I also don't fully understand your suggestion above. Sure, votes and consensuses and in general all files containing just a single descriptor could be skipped just from looking at file name, file size, or file last modified time. But how would we handle files containing dozens or even hundreds of descriptors? It seems that those files would be different in almost all cases, except when two instances download the exact same descriptors in a given hour, which won't happen if one instance reads cached-* descriptors or another instance fetches a missing descriptor from a third instance.

Overall, I think I'd rather want us to keep things simple here for now and think about optimizing later. What do you think?

comment:5 in reply to:  4 Changed 3 years ago by iwakeh

Replying to karsten:

...
Overall, I think I'd rather want us to keep things simple here for now and think about optimizing later. What do you think?

Yes , agreed. Let's just go with the simple download solution and first focus on the tagging of the descriptors from different sources.

There are several approaches here:

  1. Only one @source tag from the direct provider. Implications:
    • a descriptor downloaded from an authority A will be tagged with A's IP.
    • a descriptor downloaded from another CollecTor B, which received it from authority A, will be tagged with B's IP.
    • a descriptor downloaded from another CollecTor B, which received it from CollecTor C, will be tagged with B's IP.
    • a known descriptor downloaded again from a different source will be ignored.
  2. Several @source tags (no duplicates) from various direct providers. Implications hereof are the same as above except for the last one: every time a descriptor is seen a @source tag is added to the descriptor.
  3. Create a structure of source tags that keeps the initial source. This quickly turns into a very complex situation.

Wanting to keep it simple and considering that at first the trustworthiness of all synchronizing CollecTors is established externally 1. or 2. might be the way to go. The design should allow for a later extension and more complex approach of source designation.

One question about the initial description:

It's important that we'd only add those @source annotations to
archived descriptors, not to recent descriptors, or we'd serve those
descriptors as new every time we're adding a @source.

How can the sources be determined when archiving?

comment:6 Changed 3 years ago by karsten

I haven't fully made up my mind about the following, but maybe it's food for thought:

Recently, gabelmoo's cached-descriptors file contained hundreds of server descriptors that had no corresponding extra-info descriptors. We cannot blame gabelmoo for accepting these valid descriptors, and even if we were to add a @source tag to these descriptors saying they came only from gabelmoo, we wouldn't later go and delete all descriptors by gabelmoo. The real problem is that anyone can produce as many descriptors as they want. Neither of the solutions above (which are based on our previous discussions) would help here.

I believe the only fix is to discard relay and bridge descriptors that are not referenced from votes or consensuses. And I know that I stated earlier that I'd also want to archive other descriptors. But I don't see yet how to achieve both.

From an implementation point of view we could build this in two phases: 1. we fetch from other CollecTor instances and believe everything we get without attaching @source tags, and 2. we create a staging area of some sort where we store descriptors that are not referenced yet and delete them after a week or so unless we see a descriptor we trust that references them. It's probably smart to do 1. first in order to make CollecTor more robust. We'll have to repackage old tarballs anyway after implementing 2., so there's no big rush there.

Again, not sure yet what to do here. Sorry for the confusion, but it seems it's not easy to do this right.

comment:7 Changed 3 years ago by iwakeh

Some thoughts:

The CollecTor side

Maybe CollecTor (or the Metrics Team) needs a data collection and handling policy?
(Or, is there anything like that I didn't find yet other than the license and of course the Tor-wide privacy goals?)

In general, CollecTor shouldn't attempt to make received data better than it is
by dropping unwanted things. At least not without some defined process.
And collected data should only be changed when there is a reason for obfuscation or
when it is enhanced (e.g. by adding the @source tag).

Handling of unwanted data

Incomplete unreferenced server descs could be stored differently:

  • referenced server descs can be stored in the way it is done now and
  • unreferenced can be kept, but stored seperately.

The synch-process could first concentrate on the referenced descriptors.

Regarding the repeated uploads:

What is the reason for all these server descriptors gabelmoo received?
Is there some benign explanation for the uploads?

There are two routers uploading more than 5000 server-descriptors in less than an hour:

router ThePuppetMasterIN 94.23.181.19 9001 0 9030
router ThePuppetMasterMID 94.23.181.18 9001 0 9030

grep -c "PuppetMasterIN" /tmp/2016-06-23-11-05-13-server-descriptors 
1800
grep -c "PuppetMasterMID" /tmp/2016-06-23-11-05-13-server-descriptors 
3596

These two routers shouldn't upload descriptors again and again.
The descriptors do not differ in relevant fields according to dir-spec.
Is this not a problem that should be tackled on the Tor side?

Maybe, we should actually search the old data for more upload frencies like the one triggering this discussion?

comment:8 in reply to:  7 Changed 3 years ago by karsten

Replying to iwakeh:

Some thoughts:

The CollecTor side

Maybe CollecTor (or the Metrics Team) needs a data collection and handling policy?
(Or, is there anything like that I didn't find yet other than the license and of course the Tor-wide privacy goals?)

There is no explicit policy like that, but it would be useful to document that in the medium term.

I guess a CollecTor policy would make more sense than one that applies to all metrics-related products, because then we'd have to either enforce that policy for all metrics-related tools or manually confirm that a tool conforms to the policy. Other tools could have their own policies.

In general, CollecTor shouldn't attempt to make received data better than it is
by dropping unwanted things.

Agreed, and a nice way to phrase this. :)

At least not without some defined process.
And collected data should only be changed when there is a reason for obfuscation or
when it is enhanced (e.g. by adding the @source tag).

Look, that's the beginning of a policy! I like that.

Handling of unwanted data

Incomplete unreferenced server descs could be stored differently:

  • referenced server descs can be stored in the way it is done now and
  • unreferenced can be kept, but stored seperately.

The synch-process could first concentrate on the referenced descriptors.

I'm not sold on this part with respect to the process. I can see how we're switching from a model where we're trusting everyone (all relays and bridges, all directory authorities, all other CollecTor instances) to just a small set of nodes (for example, the set of directory authorities listed in tor.git at a certain point in time). But doing so is a major engineering effort, whereas continuing to trust everyone and risking to get spammed is easy. Also, once we limit trust we can always go through the tarballs and rip out everything we shouldn't have accepted. Hence, I'd say let's handle all data, wanted or unwanted, the same for now.

But in the future, yes, let's consider doing this. Once we do we should talk to ln5 about his plans to apply certificate transparency concepts to create a Tor network data archive, where spam descriptors turned out to be a major issue, too.

Regarding the repeated uploads:

What is the reason for all these server descriptors gabelmoo received?
Is there some benign explanation for the uploads?

Probably not. But even if we find the reason and fix this, we cannot undo that it happened in the past, we cannot guarantee that there will be no future bugs like this one, and we cannot prevent malicious relays from flooding the directory authorities with random descriptors without there being a bug. Or did you mean that directory authorities shouldn't accept as many descriptors from a single source? I'm not sure how that would work, and for the directory authorities it's not that much of a problem to get spammed temporarily. So, I think we might not be able to fix our issue with spam descriptors in the tor daemon.

Maybe, we should actually search the old data for more upload frencies like the one triggering this discussion?

We could, but what would we do once we find similar events? When does a malicious descriptor flood begin and what's still expected behavior? I think if we want to solve the descriptor spam problem we'll have to limit ourselves to descriptors published by trusted entities and descriptors referenced from such descriptors directly or indirectly.

Sorry for the long response. It's a difficult problem, it seems.

comment:9 Changed 3 years ago by iwakeh

Milestone: CollecTor 1.1.0

Added to second release milestone.

comment:10 Changed 3 years ago by iwakeh

depends on #19791

comment:11 Changed 3 years ago by karsten

Moving forward here, after thinking about this problem a bit more. I'd say let's give up on the @source annotation idea and, for now, simply trust whatever we get from other (trusted) CollecTor instances. The goal should be to start syncing data soon to finally turn the single point of failure into many. And if the spam problem turns out to be a real problem, let's solve it. However, let's keep potentially malicious CollecTor instances in mind by taking the following precautions:

  • Allow the operator not only to configure which CollecTor instances to sync from, but also let them configure which descriptor types to sync from a given instance. This includes looking at synced descriptor contents and skipping unwanted descriptor types (example: bridge descriptor "accidentally" contained in synced relay descriptor files). For example, it makes little sense for the primary CollecTor instance to sync bridge descriptors from anywhere, because it's the only source for them. (Oh, while writing this, please disregard this suggestion if the plan was to limit this feature to relay descriptors anyway.)
  • Check whether the local instance already contains synced data and only store remote data if it's better than local data. For example, it might be that a remotely obtained consensus contains fewer signatures than the local copy of that consensus, in which case the local copy should be kept. But in some cases it's worth adding parts of remote data or even replace local data, after being sure that no information gets lost. Requires a per-case consideration. (Note that this enhancement is not specific to syncing from CollecTor mirrors but that it also makes sense to make it for fetching from different directory authorities. It just gets even more important now.)

What precautions did I miss? And what else is missing to build this?

comment:12 Changed 3 years ago by iwakeh

depends on #19934 (and #19791 as mentioned above)

comment:13 Changed 3 years ago by iwakeh

Priority: MediumHigh

The following is a summary of the discussion above and elsewhere, and should give an overview of the first sync-version functionality.

Functionality and design of descriptor distribution in CollecTor 1.1.0

Configuration

  1. General settings Add a SyncManager configuration in the Scheduler section of the properties file. Property SyncFolder contains the path for storing the downloded descriptors.
  2. Choice of sync-sources Add a configuration property SyncSources containing an array of strings specifying a source name and source URL for each CollecTor instance to retireve descriptors from. This setup is similar to the current torperf configuration.
  3. Choice of descriptors Add a configuration property SyncDescriptorLists, which will contain comma separated lists (separated by space) with a source name defined in SyncSources and a list of descriptor designations.
  4. Backup of replaced local files if KeepReplaceBackup is set to true, keep a copy of the old local descriptors in BackupFolder.

SyncManager

The SyncManager module will be started by the Scheduler accordinng to the configuration defined above.
Each SyncManager run will perform the following steps:

  1. Retrieve descriptors from the CollecTor instances defined in SyncSources. These descriptors are stored in SyncFolder under the host part of the instance's url, e.g. my-sync-folder/collector.torproject.org/recent/exit-lists for exitlists from the main instance.
  2. Following retrieval the fetched descriptors are examined:
    1. discard descriptor files that do not contain what they should (see comment:11) and log a warning with sync-source info and reason (see criteria).
    2. move valid descriptors (see criteria) without a pre-existing local copy to the localstore.
    3. if there is a local copy already, decide which copy to keep (see criteria).
      1. local copy is kept, log debug message with source and reason and delete fetched descriptor.
      2. local and fetched are identical, log debug message with source and reason and delete fetched descriptor.
      3. fetched copy should replace local descriptor. Depending on KeepReplaceBackup move local copy to BackupFolder and move fetched copy to main storage. If KeepReplaceBackup is false, replace local copy by fetched. In all cases log debug message with source and reason.

Replacement criteria

As the replacement criteria are not fully defined yet and it is very likely that there will be more criteria in future a modular/pluggable approach seems useful, i.e.:

  1. define KeepCriterium and ReplaceCriterium interfaces
  2. register implementing classes with the SyncManager, which will apply these for the selection steps described above.

Open Questions

  1. Which KeepCriterium and ReplaceCriterium classes shuld be implemented initially? currently there are
    1. a ReplaceCriterium keep the consensus with more signatures and
    2. a KeepCriterium only keep descriptors that contain what they claim to be.
    3. More criteria that should be implemented with release 1.1.0?
  2. Should the applied criteria be configurable? E.g. this could be done by listing the classes in collector.properties, but we have already more than fifty config settings, which is a lot.
  3. The data combination mentioned in comment:11 part two is not yet considered, but the design will be open to add this later. Anyway some questions: What kind of data enhancement could be there? What about descriptor signatures?

Set to high in order to solve the open questions quickly.

comment:14 in reply to:  13 ; Changed 3 years ago by karsten

Hmm, the suggested config options would imply that there's only one new sync manager module that syncs all descriptors from the various sources and that runs, say, once per hour? I wonder how to schedule that in a way that it does not interfere with the other modules. So far, modules were pretty much independent, but this new module would create a dependency between modules.

Alternative suggestion: we add four (sets of) configurations, one for each module, that internally re-use the same code for syncing descriptors and for importing them. For example, SyncRelayDescriptors, SyncBridgeDescriptors, SyncExitLists, and SyncTorperfFiles. We could then provide a remote path where to find descriptor files (like /recent/relay-descriptors/) and could implictly only consider descriptor types that the respective module understands (like RelayServerDescriptor, RelayExtraInfoDescriptor, etc., but not BridgeServerDescriptor).

(If we're worried that there are too many config options already, I'm more than happy to make a list of options that can go away! But this shouldn't mean we should hold back useful new options.)

Here's a potential policy we could apply to decided whether to keep a local or remote descriptor: while syncing, if we find out that a remotely obtained descriptor would be stored under a file name that already exists locally, we always discard that; and while processing descriptors locally, if we find that we already have a file locally with different content, which we likely received while syncing, we always overwrite that. This means that we're only adding data but never replacing data.

Regarding deleting synced descriptors, we should never do that, but we should rather let DescriptorCollector clean up the local directory when it finds that a local file does not exist anymore remotely.

Here's something else to watch out for while writing this code: whenever we learn descriptors from syncing, we'll have to include them in our /recent/ directory, too. This wasn't entirely clear to me from the description above, so if this was already the plan, never mind.

comment:15 in reply to:  14 Changed 3 years ago by iwakeh

Thanks for the remarks and suggestions!
I'm replying inline below and also add a wiki page CollecTor Sync that contains the current status of the discussion. Please, take a look there to see the entire picture.

Replying to karsten:

Hmm, the suggested config options would imply that there's only one new sync manager module that syncs all descriptors from the various sources and that runs, say, once per hour? I wonder how to schedule that in a way that it does not interfere with the other modules. So far, modules were pretty much independent, but this new module would create a dependency between modules.

You're right, they should stay independent. I intended that, too, but I had a different (more complicated) architecture in mind.

Alternative suggestion: we add four (sets of) configurations, one for each module, that internally re-use the same code for syncing descriptors and for importing them. For example, SyncRelayDescriptors, SyncBridgeDescriptors, SyncExitLists, and SyncTorperfFiles.

Good idea! So we run the sync-function after or instead of the module run (see wiki page for more).

We could then provide a remote path where to find descriptor files (like /recent/relay-descriptors/) and could implictly only consider descriptor types that the respective module understands (like RelayServerDescriptor, RelayExtraInfoDescriptor, etc., but not BridgeServerDescriptor).

Actually, the directory structure of a CollecTor's 'recent' is given, i.e. the different mirrors won't or shouldn't use a different directory sructure than the main instance. So, it suffices to activate the module and set the sync or sync-only option. The path structure for the actual download is determined. The straightforward paths for torperf and exitlists and the more complex structure for bridge- and relay-descriptors.

Here's a potential policy we could apply to decided whether to keep a local or remote descriptor: while syncing, if we find out that a remotely obtained descriptor would be stored under a file name that already exists locally, we always discard that;...

So, while syncing means while retrieving descriptors from a different instance and writing them to the local SyncFolder structure. And, during this process descriptors already available in the sync-folder are not replaced.

... and while processing descriptors locally, if we find that we already have a file locally with different content, which we likely received while syncing, we always overwrite that. This means that we're only adding data but never replacing data.

This refers to the process of comparing the descriptors fetched from remote instances with descriptors already in the 'recent' folder of the syncing instance? Such local descriptors could have been obtained by direct download or a different syncing operation. Did I miss something here?

Regarding deleting synced descriptors, we should never do that, but we should rather let DescriptorCollector clean up the local directory when it finds that a local file does not exist anymore remotely.

True, if this refers to descriptors in the SyncFolder.

Here's something else to watch out for while writing this code: whenever we learn descriptors from syncing, we'll have to include them in our /recent/ directory, too. This wasn't entirely clear to me from the description above, so if this was already the plan, never mind.

That was intended, but should be clearly stated; will be added to the wiki page.

Hope I don't see things too complicated.

comment:16 Changed 3 years ago by karsten

Alright, I read the wiki page and the comment above. Just two clarifications of what I meant above:

  • The only code with permission to write to (and delete from) SyncFolder should be DescriptorCollector, which would also delete files as soon as they disappear remotely. We shouldn't move away or delete files while going through that directory and looking at descriptors, because that would mean that DescriptorCollector would have to download them again next time. Every time it runs.
  • There will now be two cases where we want to write a descriptor and need to check if we already have it: 1) when downloading it locally and 2) when syncing from another CollecTor instance. In the first case, if a file already exists and has different contents, we now overwrite it. It could be a descriptor we synced from another instance in a previous run or obtained earlier by downloading it from somewhere. Note that we're looking at our out/ directory to decide whether we already have a descriptor, not at our recent/ directory. In the second case, we simply don't store the descriptor. I think that's fine as initial strategy.

If this all makes sense, feel free to work on the code, and I'll take a look once there's something to review. If it doesn't make sense yet, feel free to ask more questions. Thanks!

comment:17 Changed 3 years ago by iwakeh

Status: needs_informationassigned

Thanks for the detailed replies!
Setting to assigned.

comment:18 Changed 3 years ago by iwakeh

Please review the seven commits on top of this branch.

There are two new packages:
sync for sync-merge functionality and persist as new modular way to persist descriptors (currently file-system, but could be extended or changed in future). The latter should step by step be used for all persisting of descriptors, i.e. be used instead of the store* methods throughout the various modules. (that is useful for removing the tight circular coupling of ArchiveWriter, DescriptorDownloader and DescriptorParser for example).

Persisting is based on DescriptorPersistence defining methods for storing. The classes extending DescriptorPersistence just need to define the explicit storage path. For convenience PersistenceUtilsprovides date-time to string methods. Thus, providing methods for code that is repeatedly defined in the current code base.

CollecTorMain extends SyncManager in a way that all synchronization options can be configured during runtime, i.e. syncing of a module can be turned on or off and sources can be changed without restart.

(I'll add package-info later for the two packages.)

comment:19 Changed 3 years ago by iwakeh

Status: assignedneeds_review

comment:20 Changed 3 years ago by karsten

Okay, I started reviewing those seven commits, but I'm running out of brains here. However, before getting back to reviewing tomorrow, I figured I could share my thoughts on the first of the seven commits, 16e1c84:

  • It seems we're deprecating a few config options by adding a comment to the default collector.properties and not paying attention to those options in the code anymore. Shouldn't we instead remove those config options entirely, so that operators notice for sure that they need to change these config options? We could mention this in the change log as medium change.
  • Speaking of, can you include a change log entry for this commit?
  • I wonder if we could simplify the configuration by avoiding that tri-state SyncType option and turning it into a boolean. Consider ImportCachedRelayDescriptors, ImportDirectoryArchives, and DownloadRelayDescriptors in the relaydescs module. We could just add a fourth option SyncRelayDescriptors that can be true or false. Basically, syncing descriptors from other CollecTor instances would be the fourth source for collecting relay descriptors. This shouldn't change much in the code you wrote, but it might make things a bit simpler for future operators. For bridge descriptors and exit lists there would have to be two new options to activate the current sources, that is, sanitize bridge descriptors found in a local directory or download exit lists from the exit list server.
  • I found at least one long line that checkstyle should complain about, though I didn't run it myself.

In case you want to start working on any of these comments, can you please write --fixup commits that resolve those issues in this particular commit? In any case, please don't modify commits 2 to 7 in that branch at this point, because I already started reviewing those.

More tomorrow. Thanks!

comment:21 in reply to:  20 ; Changed 3 years ago by iwakeh

Replying to karsten:

...

  • It seems we're deprecating a few config options by adding a comment to the default collector.properties and not paying attention to those options in the code anymore. Shouldn't we instead remove those config options entirely, so that operators notice for sure that they need to change these config options? We could mention this in the change log as medium change.

This is only a warning that these will be deprecated. They're still used, but already adressed in #20162. But they should be removed here. That wouldn't be a small commit, though, and I'd rather have it separate. I'll add a new branch just up to that commit later today.

  • Speaking of, can you include a change log entry for this commit?

Sure!

  • I wonder if we could simplify the configuration by avoiding that tri-state SyncType option and turning it into a boolean. Consider ImportCachedRelayDescriptors, ImportDirectoryArchives, and DownloadRelayDescriptors in the relaydescs module. We could just add a fourth option SyncRelayDescriptors that can be true or false. Basically, syncing descriptors from other CollecTor instances would be the fourth source for collecting relay descriptors. This shouldn't change much in the code you wrote, but it might make things a bit simpler for future operators. For bridge descriptors and exit lists there would have to be two new options to activate the current sources, that is, sanitize bridge descriptors found in a local directory or download exit lists from the exit list server.

Sync-Options (this description should be added to package-info, when we agree):
There are currently three modules that can be synced: relaydescs, bridgedescs, and exitlists.

A module can be turned on or off via its *Activated option, which is only configurable at start-up and determines if anything is run from this module.

The runtime configurable tri-state Sync* option values are

  • NoSync: for simply running the module w/o fetching additional data. Useful when just the directly (from the Tor network) accessible data is of interest, or when having access to bridgedescs etc.
  • Sync performs the module run and then fetches data from other instances.
  • SyncOnly just fetches data from other instances. Useful for mirrors that don't have access to bridgedescs, or that fetch from a main instance they trust and don't have net access otherwise, etc.

Sync* options can be adapted during runtime. It is possible to switch to syncing and then turn it off during runtime.

So, reducing the tri-state would complicate the combination of 'activated' and sync-settings, which is separated from sync, i.e. Scheduler doesn't and shouldn't now about syncing.

  • I found at least one long line that checkstyle should complain about, though I didn't run it myself.

Oh, I usually run this at the last commit. Will check again.

In case you want to start working on any of these comments, can you please write --fixup commits that resolve those issues in this particular commit? In any case, please don't modify commits 2 to 7 in that branch at this point, because I already started reviewing those.

As said above, I'll add another branch with commits we agree on and in a way that the commits make sense in the final master version.

More tomorrow. Thanks!

Thanks, for reading your way through all that un-commented code!

comment:22 in reply to:  21 ; Changed 3 years ago by karsten

Replying to iwakeh:

Replying to karsten:

[...] Shouldn't we instead remove those config options entirely, [...]

This is only a warning that these will be deprecated. They're still used, but already adressed in #20162. But they should be removed here. That wouldn't be a small commit, though, and I'd rather have it separate. I'll add a new branch just up to that commit later today.

Okay. I was confused by the comment saying "only OutputPath is used". Looking forward to the new branch you're mentioning.

  • Speaking of, can you include a change log entry for this commit?

Sure!

Thanks!

[...] I wonder if we could simplify the configuration by avoiding that tri-state SyncType option and turning it into a boolean. [...]

Sync-Options (this description should be added to package-info, when we agree):
There are currently three modules that can be synced: relaydescs, bridgedescs, and exitlists.

A module can be turned on or off via its *Activated option, which is only configurable at start-up and determines if anything is run from this module.

The runtime configurable tri-state Sync* option values are

  • NoSync: for simply running the module w/o fetching additional data. Useful when just the directly (from the Tor network) accessible data is of interest, or when having access to bridgedescs etc.
  • Sync performs the module run and then fetches data from other instances.
  • SyncOnly just fetches data from other instances. Useful for mirrors that don't have access to bridgedescs, or that fetch from a main instance they trust and don't have net access otherwise, etc.

Sync* options can be adapted during runtime. It is possible to switch to syncing and then turn it off during runtime.

So, reducing the tri-state would complicate the combination of 'activated' and sync-settings, which is separated from sync, i.e. Scheduler doesn't and shouldn't now about syncing.

Hmm, I don't understand how the change I suggested would affect the ability to reconfigure at runtime. I admit I didn't read that code yet (which comes in later commits, I think), but in theory it should still be possible to enable or disable syncing if it's yet another descriptor source.

Here's how I'd explain configuring the relaydescs module to a new operator: "Want to collect relay descriptors? Sure, just activate that module and also enable one or more descriptor sources: download from the directory authorities, read from a local directory, read from local cached descriptors, synchronize from other CollecTor instances. If you don't enable any descriptor sources, you won't get any descriptors."

This feels easier than: "Want to collect relay descriptors? Sure, just activate that module and also enable between zero and three descriptors sources, and also consider synchronizing from other CollecTor instances."

The three modules could still use the same code for synchronizing, so I'm not asking to change much code here. But I'm worried that the tri-state config option will be harder to grasp for operators than it has to be.

  • I found at least one long line that checkstyle should complain about, though I didn't run it myself.

Oh, I usually run this at the last commit. Will check again.

Okay, no rush. :)

In case you want to start working on any of these comments, can you please write --fixup commits that resolve those issues in this particular commit? In any case, please don't modify commits 2 to 7 in that branch at this point, because I already started reviewing those.

As said above, I'll add another branch with commits we agree on and in a way that the commits make sense in the final master version.

Thanks!

More tomorrow. Thanks!

Thanks, for reading your way through all that un-commented code!

Still reviewing the other other commits...

comment:23 in reply to:  22 ; Changed 3 years ago by iwakeh

Replying to karsten:

...
Here's how I'd explain configuring the relaydescs module to a new operator: "Want to collect relay descriptors? Sure, just activate that module and also enable one or more descriptor sources: download from the directory authorities, read from a local directory, read from local cached descriptors, synchronize from other CollecTor instances. If you don't enable any descriptor sources, you won't get any descriptors."

This feels easier than: "Want to collect relay descriptors? Sure, just activate that module and also enable between zero and three descriptors sources, and also consider synchronizing from other CollecTor instances."

This way of configuring only works for the current relaydescs module (whose properties are supposed to be removed partially in #20162).
The other two modules, bridgedescs and exitlists, don't have any additional properties that could be used for avoiding the 'SyncOnly' option.
In general, it is nicer to have a configuration value that expresses what it means. The different download options for the relaydescs module where initially quite confusing.
But maybe, I complicate things? So what other 'streamlined' way could achieve the SyncOnly setting?

comment:24 Changed 3 years ago by iwakeh

Here is the second branch, so far only the properties changes.

Checkstyle doesn't complain about these changes.

comment:25 Changed 3 years ago by karsten

Alright, here's a review of f4026fc:

  • All classes in the new persist package contains unused imports or unused attributes. Can you remove those in a fixup commit?
  • Can we somehow use printf patterns for putting together paths to make that bunch of DescriptorPersistence classes easier to read? It should be possible to express all paths using placeholders for dates/times, strings, and chars. And if we care about path separators being platform independent, we could split the resulting path at / characters and put it back together using Paths.get(). Untested example for a bridge extra-info descriptor:
String.format("bridge-descriptors/%tY/%<tm/extra-infos/%c/%c/%s",
    desc.getPublishedMillis(),
    desc.getExtraInfoDigest().charAt(0),
    desc.getExtraInfoDigest().charAt(1),
    desc.getExtraInfoDigest());
  • Building upon the previous idea, maybe we can avoid having DescriptorPersistence classes at all if all we need is a type annotation and two printf patterns. We could define an order of printf arguments for all patterns, like: published, received, source, digest, first char of digest, second char of digest. That would really save a few lines of code, wouldn't it? Maybe something for later?
  • Half of the paths generated by DescriptorPersistence subclasses do not match paths as they are currently implemented. I only read the code and did not run it, but I believe the following actual paths would not have been generated by these classes:
./out/bridge-descriptors/2016/10/extra-infos/0/0/0000000001764ef8b8b5bc9ed70b9e99225112ffd04
./out/bridge-descriptors/2016/10/server-descriptors/1/1/112fd90a05866092a50a4ab5b1e07ee1749259fe
./out/bridge-descriptors/2016/10/statuses/06/20161006-123817-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1
./out/relay-descriptors/microdesc/2016/10/consensus-microdesc/06/2016-10-06-12-00-00-consensus-microdesc
./out/relay-descriptors/microdesc/2016/10/micro/6/6/66d7428cefc72f63b26fbee812797cc6c4ff1c34729b79631bf6c1717d46e82e
./out/relay-descriptors/server-descriptor/2016/10/5/3/53c11f7232b3d7ef6501112eb1cdc2fb998b0197
./out/torperf/2016/10/06/moria-51200-2016-10-06.tpf
./recent/bridge-descriptors/statuses/20161012-073817-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1
./recent/relay-descriptors/microdescs/consensus-microdesc/2016-10-12-03-00-00-consensus-microdesc
./recent/relay-descriptors/server-descriptors/2016-10-12-07-05-00-server-descriptors
./recent/relay-descriptors/votes/2016-10-09-12-00-00-vote-23D15D965BC35114467363C165C4F724B64B4F66-732E4ED709E1E9F84AD19B686C18202DB1524410
  • For the sake of completeness, I believe that the following paths would have been generated by these classes:
./out/exit-lists/2016/10/12/2016-10-12-03-02-00
./out/relay-descriptors/consensus/2016/10/06/2016-10-06-12-00-00-consensus
./out/relay-descriptors/extra-info/2016/10/1/1/1135684f37075fa58f525216444512c4f64e2e9c
./out/relay-descriptors/vote/2016/10/02/2016-10-02-13-00-00-vote-E8A9C45EDE6D711294FADF8E7951F4DE6CA56B58-2A409B6A22D0266CBA2175E837B39C435BACA312
./recent/bridge-descriptors/extra-infos/2016-10-12-02-09-00-extra-infos
./recent/bridge-descriptors/server-descriptors/2016-10-12-02-09-00-server-descriptors
./recent/exit-lists/2016-10-12-03-02-00
./recent/relay-descriptors/consensuses/2016-10-11-12-00-00-consensus
./recent/relay-descriptors/extra-infos/2016-10-12-05-05-00-extra-infos
./recent/relay-descriptors/microdescs/micro/2016-10-12-09-05-00-micro
./recent/torperf/moria-51200-2016-10-12.tpf
  • In ConsensusPersistence, the check for null == desc.getConsensusFlavor() seems too fragile with respect to flavors we'll add in the future. We wouldn't notice except for microdesc consensuses being overwritten. We should check the actual consensus flavor if it's not null.
  • In DescriptorPersistence, I wonder if we should not accept StandardOpenOption or even StandardOpenOption ... to distinguish replace vs. append but rather just a boolean. It seems like an implementation level detail we're handing out to the caller. Are we sure we're implementing all options and in particular combination of options correctly? Those checks where we're only looking at how many such option values we're being given look really fragile. If we only want to support replace vs. append, we should only provide those two options in the interface.
  • Same class, method storeAll() should be more explicit in the method name or at least documentation that recent will only be written if writing archive is successful.
  • Same class, methods storeRecent() and storeOut() might have wrong argument orders in their Paths.get() calls, but I'm not sure.
  • ExtraInfoPersistence does not distinguish enough between relay and bridge extra-info descriptors. The latter use a different @type annotation which needs to be reflected in the constructor. It might be easier to use two distinct DescriptorPersistence classes for RelayExtraInfoDescriptor and BridgeExtraInfoDescriptor.
  • Same class, @type bridge-extra-info is not 1.0 but 1.3 right now. And we'll almost certainly forget to update the version number there in case we move to 1.4. Can we avoid having to remember that? This probably applies to all @type annotations, though the bridge descriptors numbers are most likely to change over time.
  • In PersistenceUtils, method storeToFileSystem(), that check for append.length == 0 could be changed to append.length <= i to make the method a tiny bit more robust against wrong usage.
  • Same class, are the dateDependentPath*() methods used anywhere? If not and if there are no plans to use them, can you remove them?
  • Same class, dateDependentPathYm() (assuming that it'll be used) does not indicate that it also includes the day in the path, and it appends the append part without dash, unlike dateDependentPathYmd().
  • ServerDescriptorPersistence: see comments for ExtraInfoPersistence.
  • StatusPersistence: why not call it BridgeNetworkStatusPersistence? But regardless of that, the version must be 1.1 and adapted whenever it updates. See above.
  • The following didn't become entirely clear from reading the code, so let me ask: are files written to temporary files like .somefile.tmp and later renamed to somefile? We should make sure we're doing that for files in recent/ that we append descriptors, because otherwise those half-written files will be indexed and made available, which we should avoid. Basically, if a file may change after being written, let's write it to a temp file and only rename it to its final file name when we're sure it won't change anymore.
  • Similar to the previous comments, I found a few formattings that should have been caught by checkstyle.

comment:26 in reply to:  23 ; Changed 3 years ago by karsten

Replying to iwakeh:

This way of configuring only works for the current relaydescs module (whose properties are supposed to be removed partially in #20162).

We're planning to remove properties there, but none of those properties would enable or disable a descriptor source.

The other two modules, bridgedescs and exitlists, don't have any additional properties that could be used for avoiding the 'SyncOnly' option.

Yes, we'd have to add two boolean options for those rather than one tri-state.

In general, it is nicer to have a configuration value that expresses what it means. The different download options for the relaydescs module where initially quite confusing.

I admit, I forget to enable descriptor sources from time to time myself when setting up a new CollecTor with relaydescs module for testing.

But maybe, I complicate things? So what other 'streamlined' way could achieve the SyncOnly setting?

Let's think about it. We have 1 module with soon 4 descriptor sources and 2 modules with soon 2 descriptor sources. How would you define config options to let operators activate each of the modules with any possible combination of descriptor sources? And keep in mind that we might want to add a 5th or 3rd descriptor source in the future.

comment:27 in reply to:  26 Changed 3 years ago by karsten

Replying to karsten:

Let's think about it. We have 1 module with soon 4 descriptor sources and 2 modules with soon 2 descriptor sources. How would you define config options to let operators activate each of the modules with any possible combination of descriptor sources? And keep in mind that we might want to add a 5th or 3rd descriptor source in the future.

Quick idea: how about we accept a list of descriptor sources for each module rather than have 4 or 2 separate boolean options? Example:

RelaydescsSources ImportCached, ImportDirectory, Download, Sync
BridgedescsSources SanitizeTarballs, Sync
ExitlistSources Download, Sync

comment:28 in reply to:  24 Changed 3 years ago by karsten

Replying to iwakeh:

Here is the second branch, so far only the properties changes.

Checkstyle doesn't complain about these changes.

Looks good! The only request would be that we make the change log entry slightly more verbose, like: "Replace four properties for configuring where to write descriptors by a single 'OutPath' property." Happy to make that change while merging. Should I merge this commit to master now?

comment:29 Changed 3 years ago by iwakeh

reply to comment 28: adapted changelog, this can go to master.

reply comment 27:
Hmm, let me think:

Well, *Sources is in use for the urls data is retrieved from, but these together with the path values for local imports could be named *Origins or similar.

The naming of all descriptor sources should be more generic. Actually, ImportDirectory, and SanitizedTarballs, all refer to a local import, i.e. the import SourceType enum provides: Cache, Local, Remote, Sync.
The corresponding sources properties could be <module><type>Origin, e.g. RelayCacheOrigins, BridgeLocalOrigins, BridgeSyncOrigins, etc.

That would shorten the properties list.

Ok, I can provide a commit with these changes, if we agree?

comment:30 Changed 3 years ago by iwakeh

Oh, I just noticed the comments 25 and 26. But, I guess the my reply to 26, 27 won't change after reading them.

Version 0, edited 3 years ago by iwakeh (next)

comment:31 Changed 3 years ago by karsten

Regarding your comment 29, yes, picking shorter names sounds fine to me. I would avoid renaming existing config options, but picking better names for new config options or values sounds great. Please prepare a commit with the described changes. Thanks!

comment:32 Changed 3 years ago by karsten

Adapted the changelog and pushed that other commit for OutPath to master. Thanks!

comment:33 Changed 3 years ago by karsten

Next in line is 4efd190:

  • In SyncManager, I think we'll have to obtain the collectionDate from the corresponding CollecTorMain subclass somehow, so that we're appending descriptors to the same files as previously downloaded or locally read descriptors. Otherwise we'd have two files in the recent/ directory per descriptor type, making the index larger than necessary. Let's avoid that if we can.
  • As stated in an earlier comment, there are unused variables and imports in this commit that should be removed.
  • Still in SyncManager, method mergeWithLocalStorage(), it seems that we're using the exact same path for parse history files for different descriptors. Example: we're reading descriptor type A with history file X, and when we're done we're reading descriptor type B with the same history file X. The effect is that we're not skipping any B descriptors and overwriting everything in X that has to do with A. Basically, we're reading the entire directory of synced descriptors in every execution. What we need is a distinct history file per descriptor type. Or we could use a single descriptor reader regardless of descriptor type with a single history file. By the way, this could have caused your OOM troubles.
  • Your log statement "Done reading {} of type {}." is misleading. The readDescriptors() method returns immediately, but that does not at all mean that there are any descriptors in the iterator at that point. A better message would be "Started reading ...".
  • In SyncPersistence, method storeDescs(), the documentation says "OutputDirectory" where it should say "OutputPath".
  • Regarding the three // comments in storeDescs(): it's indeed unfortunate that we need to infer the authority fingerprint of a bridge network status from the file name. We should consider adding a new line for that to sanitized bridge network statuses. That's entirely unrelated to the changes here but probably worth a ticket.
  • The second comment there is probably moot with microdescriptor syncing being postponed.
  • Regarding the third comment about exit lists, we're currently using the initial download time for storing in out/ and recent/. But I think that's something to ensure in ExitlistPersistence, not here, so the code here can stay unchanged.

comment:34 Changed 3 years ago by karsten

1e07139 looks good.

comment:35 in reply to:  31 ; Changed 3 years ago by iwakeh

Replying to karsten:

Regarding your comment 29, yes, picking shorter names sounds fine to me. I would avoid renaming existing config options, but picking better names for new config options or values sounds great. Please prepare a commit with the described changes. Thanks!

Please review this commit with the additional changes.

comment:36 Changed 3 years ago by iwakeh

The naming discussion should continue in #20162. Please, see comment 6 in #20162.

comment:37 Changed 3 years ago by karsten

And here are the last three commits from your initial branch, starting with 5eb89c1:

  • Looks good, modulo whitespace issues that should become visible with checkstyle.

And 4722ba1:

  • I'd want to avoid merging this and instead investigate the problem further. How would I reproduce the OOM? Run the previous commit with syncing from the main CollecTor instance and wait for how many hours?

Last and probably least, d2fcb1b:

  • We should simply squash this one into the earlier commit that added these lines. And we should add a note to collector.properties that microdescs are exempt from syncing. And we should create a ticket as a reminder to add microdescs to the sync part later on.

comment:38 in reply to:  37 Changed 3 years ago by iwakeh

Replying to karsten:

And here are the last three commits from your initial branch, starting with 5eb89c1:

Thanks for working your way through and all the useful comments/suggestions!
(I still need to reply to some.)

  • Looks good, modulo whitespace issues that should become visible with checkstyle.

The commits in the second branch all are checkstyle-compliant.

And 4722ba1:

  • I'd want to avoid merging this and instead investigate the problem further. How would I reproduce the OOM? Run the previous commit with syncing from the main CollecTor instance and wait for how many hours?

Yes, use the collector jar from the previous commit. Run it in a new location, i.e. with empty recent and out. Configure relaydescs and sync. RunOnce will be sufficient the OOM heap dump contains very many Reference objects. (Your assumption in one of the previous comments hinted in the right direction, I think, but I couldn't verify that yet.)

The sync into an empty place approximately takes an hour depending on the internet connection.

Last and probably least, d2fcb1b:

  • We should simply squash this one into the earlier commit that added these lines. And we should add a note to collector.properties that microdescs are exempt from syncing. And we should create a ticket as a reminder to add microdescs to the sync part later on.

I just don't add microdescs-sync to the new branch in first place and will open the new ticket.

comment:39 Changed 3 years ago by iwakeh

microdesc-sync ticket #20345 added.

comment:40 in reply to:  35 ; Changed 3 years ago by karsten

Replying to iwakeh:

Replying to karsten:

Regarding your comment 29, yes, picking shorter names sounds fine to me. I would avoid renaming existing config options, but picking better names for new config options or values sounds great. Please prepare a commit with the described changes. Thanks!

Please review this commit with the additional changes.

Thanks! Here are some minor comments on 36d0b44:

  • Should we hold back the four *Sync* options until there's actual support for syncing? I don't feel strongly about this, because we're going to add syncing support really soon. But just in case you say "oh wow, I totally forgot to take them out, let me fix that!", okay. Either way works for me.
  • The new options RelayLocalOrigins and BridgeLocalOrigins each only accept a single path. Should we rename them to singular then? Again, don't feel strongly.
  • The new option BridgeSources does not support Remote as stated in the comment of collector.properties.
  • Totally unrelated to this commit: should we kill the updateindex module and instead have an IndexManager with a synchronized update() method that gets called at the end of each module? Happy to move this to another ticket, but it just came to mind while looking at all the options...

comment:41 in reply to:  40 ; Changed 3 years ago by iwakeh

Replying to karsten:

...

  • Should we hold back the four *Sync* options until there's actual support for syncing? I don't feel strongly about this, because we're going to add syncing support really soon. But just in case you say "oh wow, I totally forgot to take them out, let me fix that!", okay. Either way works for me.

I extended the commit comment (see link below).

  • The new options RelayLocalOrigins and BridgeLocalOrigins each only accept a single path. Should we rename them to singular then? Again, don't feel strongly.

Thought about that , too, but finally decided to keep it a *Options.

  • The new option BridgeSources does not support Remote as stated in the comment of collector.properties.

Corrected in here (amended).

  • Totally unrelated to this commit: should we kill the updateindex module and instead have an IndexManager with a synchronized update() method that gets called at the end of each module? Happy to move this to another ticket, but it just came to mind while looking at all the options...

Currently, there is also the create tars script changing the file structure. This should become java code. => new ticket?
Index-update could then be changed from being a module into a feature of module runs. => second ticket?

comment:42 in reply to:  41 ; Changed 3 years ago by karsten

Replying to iwakeh:

Replying to karsten:

...

  • Should we hold back the four *Sync* options until there's actual support for syncing? I don't feel strongly about this, because we're going to add syncing support really soon. But just in case you say "oh wow, I totally forgot to take them out, let me fix that!", okay. Either way works for me.

I extended the commit comment (see link below).

  • The new options RelayLocalOrigins and BridgeLocalOrigins each only accept a single path. Should we rename them to singular then? Again, don't feel strongly.

Thought about that , too, but finally decided to keep it a *Options.

  • The new option BridgeSources does not support Remote as stated in the comment of collector.properties.

Corrected in here (amended).

Cherry-picked and pushed to master. Thanks!

  • Totally unrelated to this commit: should we kill the updateindex module and instead have an IndexManager with a synchronized update() method that gets called at the end of each module? Happy to move this to another ticket, but it just came to mind while looking at all the options...

Currently, there is also the create tars script changing the file structure. This should become java code. => new ticket?

Yes, please.

Index-update could then be changed from being a module into a feature of module runs. => second ticket?

Yes, please.

comment:43 in reply to:  42 Changed 3 years ago by iwakeh

Replying to karsten:

...

  • Totally unrelated to this commit: should we kill the updateindex module and instead have an IndexManager with a synchronized update() method that gets called at the end of each module? Happy to move this to another ticket, but it just came to mind while looking at all the options...

Currently, there is also the create tars script changing the file structure. This should become java code. => new ticket?

Yes, please.

Index-update could then be changed from being a module into a feature of module runs. => second ticket?

Yes, please.

Created #20350 and #20351.

comment:44 Changed 3 years ago by iwakeh

Please clarify the path differences from comment:25 above. I've been looking at the paths and the protocol for too long now that I caught some path-blindness (temporary).
It would be very helpful, if you could use the test-cases listed here and indicate what should be changed, i.e. basically which paths are expected. Then the tests can verify the structure.

The protocol might need to be changed, if this path ./out/bridge-descriptors/2016/10/extra-infos/0/0/0000000001764ef8b8b5bc9ed70b9e99225112ffd04 is correct. It says:

5.2 'bridge-descriptors' below 'out'

   'bridge-descriptors' contains the following subdirectories:

   * extra-infos
   * server-descriptors
   * statuses

5.2.1
   'extra-infos' and 'server-descriptors' have the following
   subdirectory structure

   year SEP month SEP first SEP second

   Where year is derived from the published date.
   'first' and 'second' are the first and second symbol from the
   router-digest, which also serves as the filename for the files
   in the 'second' level directories.

   Tars are named according to section 2.3 and have the following
   substructure using the definitions from 2.3:

   BRIDGE DASH marker DASH year DASH month SEP first SEP second

5.2.2
   'statuses' have a different substructure

   year SEP month SEP day

comment:45 in reply to:  25 Changed 3 years ago by iwakeh

I think below I reply to all of 25 except the paths structures, which will have to wait for the commit of the tests.
Changes can be found in the new commit of the second branch.

... - All classes in the new persist package contains unused imports or unused attributes. Can you remove those in a fixup commit?

Removed in second branch.

  • Can we somehow use printf patterns for putting together paths to make that bunch of DescriptorPersistence classes easier to read? It should be possible to express all paths using placeholders for dates/times, strings, and chars. And if we care about path separators being platform independent, we could split the resulting path at / characters and put it back together using Paths.get(). Untested example for a bridge extra-info descriptor:
String.format("bridge-descriptors/%tY/%<tm/extra-infos/%c/%c/%s",
    desc.getPublishedMillis(),
    desc.getExtraInfoDigest().charAt(0),
    desc.getExtraInfoDigest().charAt(1),
    desc.getExtraInfoDigest());

What makes this more readable to me, is the code formatting, i.e. one line per path element. Whereas, I don't read string format patterns that often, which makes it less readable to me. (String.format is also locale dependent and would have to have an additional argument. )
Added the one line per path element to the new branch as well as some comments for the path elements. Hope this works for you too?

  • Building upon the previous idea, maybe we can avoid having DescriptorPersistence classes at all if all we need is a type annotation and two printf patterns. We could define an order of printf arguments for all patterns, like: published, received, source, digest, first char of digest, second char of digest. That would really save a few lines of code, wouldn't it? Maybe something for later?

The path deriving code surely can be done nicer. The main goal for the current code was to extract the somewhat hidden (in huge methods/classes) path structures and have them in a defined location, which opens the door for future simplifications.
Having persistence classes might be useful for other types of persistence or the statistics (cf. #20335, #20345).

...

  • In ConsensusPersistence, the check for null == desc.getConsensusFlavor() seems too fragile with respect to flavors we'll add in the future. We wouldn't notice except for microdesc consensuses being overwritten. We should check the actual consensus flavor if it's not null.

If flavor is something other than "microdesc", the consensus type will be plain consensus.

  • In DescriptorPersistence, I wonder if we should not accept StandardOpenOption or even StandardOpenOption ... to distinguish replace vs. append but rather just a boolean. It seems like an implementation level detail we're handing out to the caller. Are we sure we're implementing all options and in particular combination of options correctly? Those checks where we're only looking at how many such option values we're being given look really fragile. If we only want to support replace vs. append, we should only provide those two options in the interface.
  • Same class, method storeAll() should be more explicit in the method name or at least documentation that recent will only be written if writing archive is successful.

The StandardOpenOptions are defined in the java api, so there is no implementation to be done. There are more than two options of interest (afaict): create_new, append, and truncate_existing, i.e. do not do anything in case the file exists, append, and replace.
I added more javadoc and overloaded the methods, so we have methods w/o options, and other methods with exactly one option per file.

  • Same class, methods storeRecent() and storeOut() might have wrong argument orders in their Paths.get() calls, but I'm not sure.

I renamed the parameters to recentRoot and outRoot to better express their meaning.

  • In PersistenceUtils, method storeToFileSystem(), that check for append.length == 0 could be changed to append.length <= i to make the method a tiny bit more robust against wrong usage.

I thought the same, so the method wasn't in use and is now removed.

  • Same class, are the dateDependentPath*() methods used anywhere? If not and if there are no plans to use them, can you remove them?

Those are gone in the second branch.

...

  • The following didn't become entirely clear from reading the code, so let me ask: are files written to temporary files like .somefile.tmp and later renamed to somefile? We should make sure we're doing that for files in recent/ that we append descriptors, because otherwise those half-written files will be indexed and made available, which we should avoid. Basically, if a file may change after being written, let's write it to a temp file and only rename it to its final file name when we're sure it won't change anymore.

Files for 'recent' are first written to a temporary file and then moved to the final location.

  • Similar to the previous comments, I found a few formattings that should have been caught by checkstyle.

There were no complaints, checked again for the changes for the commit on the second branch.

Regarding @type annotations: I introduced an Annotation enum, which contains all @type annotations. All occurances of @type strings are replaced. So, now these annotations are in one place. But, don't they actually belong to metrics-lib? There the parser could check for unknown annotation versions or types.

comment:46 Changed 3 years ago by karsten

Here's a diff of paths in the test class to reflect paths in the currently deployed code:

diff --git a/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java b/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java
index ad5ee6a..cdb90b8 100644
--- a/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java
+++ b/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java
@@ -58,15 +58,15 @@ public class SyncPersistenceTest {
   @Parameters
   public static Collection<Object[]> pathFilename() {
     return Arrays.asList(new Object[][] {
-        {"exit-lists/2016-10-05-19-06-17", // expected recent path
+        {"exit-lists/2016-09-20-13-02-00", // expected recent path
          "exit-lists/2016/09/20/2016-09-20-13-02-00", // expected out path
          "2016-09-20-13-02-00", // test-filename
          Integer.valueOf(1), // expected recent count of descs files
          Integer.valueOf(1)}, // expected output count of descs files
 
         {"relay-descriptors/microdescs/consensus-microdesc/"
-             + "2016-10-05-19-06-17-consensus-microdesc",
-         "relay-descriptors/microdescs/2016/10/consensus-microdesc/"
+             + "2016-10-02-17-00-00-consensus-microdesc",
+         "relay-descriptors/microdesc/2016/10/consensus-microdesc/"
              + "02/2016-10-02-17-00-00-consensus-microdesc",
          "2016-10-02-17-00-00-consensus-microdesc",
          Integer.valueOf(1),
@@ -74,7 +74,7 @@ public class SyncPersistenceTest {
 
         {"bridge-descriptors/server-descriptors/"
              + "2016-10-05-19-06-17-server-descriptors",
-         "bridge-descriptors/server-descriptor/2016/10/"
+         "bridge-descriptors/2016/10/server-descriptor/"
              + "A/8/A8A5509AD1393C8F36ABD2D8F0DE1BB751926872",
          "bridge-2016-10-02-16-09-00-server-descriptors",
          Integer.valueOf(1),
@@ -88,28 +88,30 @@ public class SyncPersistenceTest {
          Integer.valueOf(1),
          Integer.valueOf(10)},
 
-        {"relay-descriptors/consensuses/2016-10-05-19-06-17-consensus",
+        {"relay-descriptors/consensuses/2016-09-20-13-00-00-consensus",
          "relay-descriptors/consensus/2016/09/20/2016-09-20-13-00-00-consensus",
          "2016-09-20-13-00-00-consensus",
          Integer.valueOf(1),
          Integer.valueOf(1)},
 
         {"bridge-descriptors/statuses/"
-             + "20161005-190617-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1",
-         "bridge-descriptors/statuses/2016/09/20/"
+             + "20160920-063816-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1",
+         "bridge-descriptors/2016/09/statuses/20/"
              + "20160920-063816-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1",
          "20160920-063816-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1",
          Integer.valueOf(1),
          Integer.valueOf(1)},
 
         {"relay-descriptors/microdescs/micro/2016-10-05-19-06-17-micro",
-         "relay-descriptors/microdescs/2016/10/micro/f/b/"
+         "relay-descriptors/microdesc/2016/10/micro/f/b/"
              + "fbf0c4cb8216d950d267866ea879880e69216bd0b46878ff0f51941ee4eca707",
          "2016-09-20-15-31-19-micro",
          Integer.valueOf(1),
          Integer.valueOf(5)},
 
-        {"relay-descriptors/votes/2016-10-05-19-06-17-votes",
+        {"relay-descriptors/votes/2016/10/01/2016-10-01-16-00-00-vote"
+             + "-0232AF901C31A04EE9848595AF9BB7620D4C5B2E"
+             + "-FEE63B4AB7CE5A6BDD09E9A5C4F01BD61EB7E4F1",
          "relay-descriptors/vote/2016/10/01/2016-10-01-16-00-00-vote"
              + "-0232AF901C31A04EE9848595AF9BB7620D4C5B2E"
              + "-FEE63B4AB7CE5A6BDD09E9A5C4F01BD61EB7E4F1",
@@ -118,7 +120,9 @@ public class SyncPersistenceTest {
          Integer.valueOf(1),
          Integer.valueOf(1)},
 
-        {"relay-descriptors/votes/2016-10-05-19-06-17-votes",
+        {"relay-descriptors/votes/2016/09/20/2016-09-20-13-00-00-vote-"
+             + "49015F787433103580E3B66A1707A00E60F2D15B"
+             + "-60ADC6BEC262AE921A1037D54C8A3976367DBE87",
          "relay-descriptors/vote/2016/09/20/2016-09-20-13-00-00-vote-"
              + "49015F787433103580E3B66A1707A00E60F2D15B"
              + "-60ADC6BEC262AE921A1037D54C8A3976367DBE87",
@@ -128,7 +132,7 @@ public class SyncPersistenceTest {
          Integer.valueOf(1)},
 
         {"bridge-descriptors/extra-infos/2016-10-05-19-06-17-extra-infos",
-         "bridge-descriptors/extra-info/2016/10/9/F/"
+         "bridge-descriptors/2016/10/extra-info/9/F/"
              + "9F88A7C2ABE6665D204137BA8C2661D42E7C2829",
          "bridge-2016-10-02-08-09-00-extra-infos",
          Integer.valueOf(1),

I'll now re-read this ticket and respond to any open parts.

comment:47 Changed 3 years ago by karsten

Regarding the protocol, you're probably right that it's still not entirely correct. To be honest, I'm having a hard time understanding the format there. Would you mind looking at the test path changes above and adapting the protocol accordingly? I can take another look at the protocol changes.

Okay, after reading/skimming this ticket, I think it would make sense for me to review that second branch with all the fixes, but I only see 1 commit there that is not yet in master. Can you put together a branch with all the changes somewhere? Or if that's not possible, can you point to specific commits that I should be looking at?

comment:48 in reply to:  47 Changed 3 years ago by iwakeh

Replying to karsten:

Regarding the protocol, you're probably right that it's still not entirely correct. To be honest, I'm having a hard time understanding the format there. Would you mind looking at the test path changes above and adapting the protocol accordingly? I can take another look at the protocol changes.

Thanks for adding the test cases!
I'll adapt the protocol after finishing the implementation changes and might add examples to clarify the definition.

Okay, after reading/skimming this ticket, I think it would make sense for me to review that second branch with all the fixes, but I only see 1 commit there that is not yet in master. Can you put together a branch with all the changes somewhere? Or if that's not possible, can you point to specific commits that I should be looking at?

Ok, just wait with the review for the completed changes.

comment:49 in reply to:  46 ; Changed 3 years ago by iwakeh

Replying to karsten:

I found the a good reference for the paths topic: it's in the create tars script:

  $OUTDIR/exit-lists/$YEARONE/$MONTHONE/
  $OUTDIR/exit-lists/$YEARTWO/$MONTHTWO/
  $OUTDIR/torperf/$YEARONE/$MONTHONE/
  $OUTDIR/torperf/$YEARTWO/$MONTHTWO/
  $OUTDIR/relay-descriptors/certs/
  $OUTDIR/relay-descriptors/microdesc/$YEARONE/$MONTHONE
  $OUTDIR/relay-descriptors/microdesc/$YEARTWO/$MONTHTWO
  $OUTDIR/relay-descriptors/consensus/$YEARONE/$MONTHONE
  $OUTDIR/relay-descriptors/consensus/$YEARTWO/$MONTHTWO
  $OUTDIR/relay-descriptors/vote/$YEARONE/$MONTHONE/
  $OUTDIR/relay-descriptors/vote/$YEARTWO/$MONTHTWO/
  $OUTDIR/relay-descriptors/server-descriptor/$YEARONE/$MONTHONE/
  $OUTDIR/relay-descriptors/server-descriptor/$YEARTWO/$MONTHTWO/
  $OUTDIR/relay-descriptors/extra-info/$YEARONE/$MONTHONE/
  $OUTDIR/relay-descriptors/extra-info/$YEARTWO/$MONTHTWO/
  $OUTDIR/bridge-descriptors/$YEARONE/$MONTHONE/statuses/
  $OUTDIR/bridge-descriptors/$YEARTWO/$MONTHTWO/statuses/
  $OUTDIR/bridge-descriptors/$YEARONE/$MONTHONE/server-descriptors/
  $OUTDIR/bridge-descriptors/$YEARTWO/$MONTHTWO/server-descriptors/
  $OUTDIR/bridge-descriptors/$YEARONE/$MONTHONE/extra-infos/
  $OUTDIR/bridge-descriptors/$YEARTWO/$MONTHTWO/extra-infos/

The above makes an inconsistency between relay descriptor and bridge descriptor paths visible.

As the new structure from #20228, i.e. votes also grouped and all dates in recent paths are derived from the acquisition time, will be in place soon, I'd like to use the 'new way' already for the merge.

With these as background the test diff has some soon to be outdated corrections (comments added after ''):

diff --git a/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java b/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java
index ad5ee6a..cdb90b8 100644
--- a/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java
+++ b/src/test/java/org/torproject/collector/sync/SyncPersistenceTest.java
@@ -58,15 +58,15 @@ public class SyncPersistenceTest {
   @Parameters
   public static Collection<Object[]> pathFilename() {
     return Arrays.asList(new Object[][] {
-        {"exit-lists/2016-10-05-19-06-17", // expected recent path  // acqu-date, which will be ok
+        {"exit-lists/2016-09-20-13-02-00", // expected recent path
          "exit-lists/2016/09/20/2016-09-20-13-02-00", // expected out path
          "2016-09-20-13-02-00", // test-filename
          Integer.valueOf(1), // expected recent count of descs files
          Integer.valueOf(1)}, // expected output count of descs files
 
         {"relay-descriptors/microdescs/consensus-microdesc/"
-             + "2016-10-05-19-06-17-consensus-microdesc", // acqu-date, which will be ok
-         "relay-descriptors/microdescs/2016/10/consensus-microdesc/"
+             + "2016-10-02-17-00-00-consensus-microdesc",
+         "relay-descriptors/microdesc/2016/10/consensus-microdesc/"
              + "02/2016-10-02-17-00-00-consensus-microdesc",
          "2016-10-02-17-00-00-consensus-microdesc",
          Integer.valueOf(1),
@@ -74,7 +74,7 @@ public class SyncPersistenceTest {
 
         {"bridge-descriptors/server-descriptors/"
              + "2016-10-05-19-06-17-server-descriptors",
-         "bridge-descriptors/server-descriptor/2016/10/"
+         "bridge-descriptors/2016/10/server-descriptor/"
              + "A/8/A8A5509AD1393C8F36ABD2D8F0DE1BB751926872",
          "bridge-2016-10-02-16-09-00-server-descriptors",
          Integer.valueOf(1),
@@ -88,28 +88,30 @@ public class SyncPersistenceTest {
          Integer.valueOf(1),
          Integer.valueOf(10)},
 
-        {"relay-descriptors/consensuses/2016-10-05-19-06-17-consensus", // acqu-date, which will be ok
+        {"relay-descriptors/consensuses/2016-09-20-13-00-00-consensus",
          "relay-descriptors/consensus/2016/09/20/2016-09-20-13-00-00-consensus",
          "2016-09-20-13-00-00-consensus",
          Integer.valueOf(1),
          Integer.valueOf(1)},
 
         {"bridge-descriptors/statuses/"
-             + "20161005-190617-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1", // acqu-date, which will be ok
-         "bridge-descriptors/statuses/2016/09/20/"
+             + "20160920-063816-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1",
+         "bridge-descriptors/2016/09/statuses/20/"
              + "20160920-063816-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1",
          "20160920-063816-1D8F3A91C37C5D1C4C19B1AD1D0CFBE8BF72D8E1",
          Integer.valueOf(1),
          Integer.valueOf(1)},
 
         {"relay-descriptors/microdescs/micro/2016-10-05-19-06-17-micro",
-         "relay-descriptors/microdescs/2016/10/micro/f/b/"
+         "relay-descriptors/microdesc/2016/10/micro/f/b/"
              + "fbf0c4cb8216d950d267866ea879880e69216bd0b46878ff0f51941ee4eca707",
          "2016-09-20-15-31-19-micro",
          Integer.valueOf(1),
          Integer.valueOf(5)},
 
-        {"relay-descriptors/votes/2016-10-05-19-06-17-votes", // acqu-date, which will be ok
+        {"relay-descriptors/votes/2016/10/01/2016-10-01-16-00-00-vote"
+             + "-0232AF901C31A04EE9848595AF9BB7620D4C5B2E"
+             + "-FEE63B4AB7CE5A6BDD09E9A5C4F01BD61EB7E4F1",
          "relay-descriptors/vote/2016/10/01/2016-10-01-16-00-00-vote"
              + "-0232AF901C31A04EE9848595AF9BB7620D4C5B2E"
              + "-FEE63B4AB7CE5A6BDD09E9A5C4F01BD61EB7E4F1",
@@ -118,7 +120,9 @@ public class SyncPersistenceTest {
          Integer.valueOf(1),
          Integer.valueOf(1)},
 
-        {"relay-descriptors/votes/2016-10-05-19-06-17-votes", // acqu-date, which will be ok
+        {"relay-descriptors/votes/2016/09/20/2016-09-20-13-00-00-vote-"
+             + "49015F787433103580E3B66A1707A00E60F2D15B"
+             + "-60ADC6BEC262AE921A1037D54C8A3976367DBE87",
          "relay-descriptors/vote/2016/09/20/2016-09-20-13-00-00-vote-"
              + "49015F787433103580E3B66A1707A00E60F2D15B"
              + "-60ADC6BEC262AE921A1037D54C8A3976367DBE87",
@@ -128,7 +132,7 @@ public class SyncPersistenceTest {
          Integer.valueOf(1)},
 
         {"bridge-descriptors/extra-infos/2016-10-05-19-06-17-extra-infos",
-         "bridge-descriptors/extra-info/2016/10/9/F/"
+         "bridge-descriptors/2016/10/extra-info/9/F/"
              + "9F88A7C2ABE6665D204137BA8C2661D42E7C2829",
          "bridge-2016-10-02-08-09-00-extra-infos",
          Integer.valueOf(1),

When changing the structure now shouldn't relays and bridges be stored in similar ways? I think the current difference (in 'out') caused some of the confusion here.

comment:50 in reply to:  49 Changed 3 years ago by karsten

Replying to iwakeh:

Replying to karsten:

I found the a good reference for the paths topic: it's in the create tars script: [...]

Indeed, that script reflects what's on the server.

The above makes an inconsistency between relay descriptor and bridge descriptor paths visible.

Right. The reason is that we recently switched from bridge descriptor tarballs containing all bridge descriptors to separate tarballs for each type. I didn't change the directory structure to avoid breaking stuff. In theory, it would have been an easy change, but going back to a previous version would have become more difficult. And it didn't seem necessary to change the directory structure, because nobody would look at it anyway.

As the new structure from #20228, i.e. votes also grouped and all dates in recent paths are derived from the acquisition time, will be in place soon, I'd like to use the 'new way' already for the merge.

With these as background the test diff has some soon to be outdated corrections (comments added after ''):

[...]

When changing the structure now shouldn't relays and bridges be stored in similar ways? I think the current difference (in 'out') caused some of the confusion here.

Agreed on the acquisition time thing. We could change that now.

Regarding the directory structure, as I said above, I didn't think this was necessary a few weeks ago, but I see your point that it's confusing for contributors.

So, if we change paths in out/ now, should we use the exact same directory structure as for the recent/ directory? Whoever runs a CollecTor instance now will have to stop it, move and rename some directories, upgrade, and restart. What we should avoid is change some paths now and some more paths in a few weeks.

comment:51 Changed 3 years ago by iwakeh

Please review this third branch.

It contains two minor commits and a large big one (on top). The final sync functionality with disabled ReferenceChecker.

I added tests for also checking the content of descriptor files. Thus, the tests can now catch annotation issues. To have one place for all annotation I introduced o.t.c.conf.Annotation, which at some point should become metrics-lib code.

The persist package is corrected and the early line breaks in the paths descriptions are there for more readability.

Files to 'recent' are first written to a temporary file and then renamed.

The current paths obey the new path structure. The way the paths are defined now it is an easy task to change the structure in case if necessary.

I think it is better to have it in one commit as all things belong together and should be read and reviewed as a whole.

Well, can't think of anything else right now.

comment:52 Changed 3 years ago by iwakeh

I also adapted the sync-urls for index.json.

comment:53 Changed 3 years ago by karsten

So, I reviewed about half of your -3 branch now but still more time tomorrow to finish the review. Here's a quick suggestion though:

Should we put out a new metrics-lib release tomorrow that this branch will be "based on"? I could imagine having the following changes in that metrics-lib release:

  • DescriptorIndexCollector is the new default.
  • That same class uses /index/index.json[.gz] as path if it only receives a base URL. I believe I suggested such a thing on #20039.
  • metrics-lib master already contains a fix that would be useful to have for new CollecTor operators: "Avoid running into an IOException and logging a warning for it." (#20320)

Note that I'm not suggesting to fix the 2G file size thing in that release. Let's stay realistic. ;)

If we do this, are there any other metrics-lib changes that we should include in the release that would help us with this CollecTor change?

More feedback on the -3 branch tomorrow. Please don't amend the commits in that branch until then.

comment:54 Changed 2 years ago by iwakeh

Yes, a release would be great!
All three changes you name are important for CollecTor 1.1.0.

Should it be metrics-lib 1.4.1?
We need a new milestone for tracking the release.

I'll look again, but I think these are the most relevant for now. And, it's better to keep the release small in order to have it out quickly.

comment:55 Changed 2 years ago by karsten

metrics-lib 1.4.1 sounds good, I just created a milestone for that. Thanks!

comment:56 Changed 2 years ago by karsten

Alright! Please find branch task-18910-3 in my repository. I created fixup/squash commits for most things I found. Please take a close look and challenge anything you don't like to be squashed into your commit. In particular, please take a look why the unit tests don't pass anymore. ;)

More comments below:

  • I believe that index.json would include the .tmp files written during sync, because we're only skipping files starting with .. And that's not even a new issue, because the .tmp files written during normal updating would also be included. The issue will only become more prevalent. I'd say we adapt the indexer, which can happen in a separate commit. Or we could change temporary files to start with ., but that seems a bit more error-prone. Suggestion: fix
  • As mentioned before somewhere, we're appending server/extra-info descriptors to different files in the synchronization run than in the earlier update run. The reason is that SyncManager creates its own collectionDate in the constructor. However, we cannot just pass a Date instance to both SyncManager and ArchiveWriter, because then SyncManager would append to a file that we're already serving, in which case the file size would change once we're done synchronizing, and we should try to avoid that. Let's fix this as soon as the current updaters are modules just like the synchronization module, and let's keep in mind not to close and rename temporary files to their final names between modules. Suggestion: postpone
  • As said earlier, let's release metrics-lib and use that new release as basis for this patch. Then we won't have to import and instantiate org.torproject.descriptor.index.DescriptorIndexCollector anymore, and we can change defaults for the *SyncOrigins properties in collector.properties back to https://collector.torproject.org. Suggestion: change
  • Regarding comment 25 above, I'm still a fan of using printf as opposed to creating SimpleDateFormat instances and concatenating strings. I also believe it would be more efficient. But I'm happy to postpone that change. Suggestion: postpone

comment:57 in reply to:  56 Changed 2 years ago by iwakeh

Thanks for working through all that code so quickly!

Replying to karsten:

Alright! Please find branch task-18910-3 in my repository. I created fixup/squash commits for most things I found. Please take a close look and challenge anything you don't like to be squashed into your commit. In particular, please take a look why the unit tests don't pass anymore. ;)

Fine. I'll start testing after finishing this reply.

More comments below:

  • I believe that index.json would include the .tmp files written during sync, because we're only skipping files starting with .. And that's not even a new issue, because the .tmp files written during normal updating would also be included. The issue will only become more prevalent. I'd say we adapt the indexer, which can happen in a separate commit. Or we could change temporary files to start with ., but that seems a bit more error-prone. Suggestion: fix

That's ok. Eventually the regular modules should use the classes in o.t.c.persist for storing descriptors (that will reduce the line count in them tremendously and finally unify writing descriptors). So, using the "." prefix here now is fine.

  • As mentioned before somewhere, we're appending server/extra-info descriptors to different files in the synchronization run than in the earlier update run. The reason is that SyncManager creates its own collectionDate in the constructor. However, we cannot just pass a Date instance to both SyncManager and ArchiveWriter, because then SyncManager would append to a file that we're already serving, in which case the file size would change once we're done synchronizing, and we should try to avoid that. Let's fix this as soon as the current updaters are modules just like the synchronization module, and let's keep in mind not to close and rename temporary files to their final names between modules. Suggestion: postpone

Agreed.

  • As said earlier, let's release metrics-lib and use that new release as basis for this patch. Then we won't have to import and instantiate org.torproject.descriptor.index.DescriptorIndexCollector anymore, and we can change defaults for the *SyncOrigins properties in collector.properties back to https://collector.torproject.org. Suggestion: change

Fine!
All these changes are reasons for metrics-lib release 1.4.1.

  • Regarding comment 25 above, I'm still a fan of using printf as opposed to creating SimpleDateFormat instances and concatenating strings. I also believe it would be more efficient. But I'm happy to postpone that change. Suggestion: postpone

Yes, it might be good to think a little about that. The timestamp to string functionality might also be a candidate for 'metrics-tools' (cf. #20405), b/c it's used throughout Metrics' products.

comment:58 Changed 2 years ago by iwakeh

The changes look good.

Please take a look at the top three commits here.

Just to not forget:
The metrics-lib version needs to be updated after the 1.4.1 release.

comment:59 Changed 2 years ago by karsten

Those three commits look good! Mind if I merge the first two into b998073?

(Do you know how to use git commit --fixup b998073 or git commit --squash b998073? I find those really helpful to avoid changing existing commits and yet making it clear how commits may be squashed during the merge for future generations to make sense of them.)

And yes, we'll need to update to the next metrics-lib release once it's out.

Oh, here's another thing we still need to do: we should make the change log more comprehensible for other operators. I have some ideas for improving what's there, but if you'd like to give that a try first, please feel free.

comment:60 in reply to:  59 Changed 2 years ago by iwakeh

Replying to karsten:

Those three commits look good! Mind if I merge the first two into b998073?

That's fine.

(Do you know how to use git commit --fixup b998073 or git commit --squash b998073? I find those really helpful to avoid changing existing commits and yet making it clear how commits may be squashed during the merge for future generations to make sense of them.)

Yes, but just didn't think of using them.

And yes, we'll need to update to the next metrics-lib release once it's out.

Oh, here's another thing we still need to do: we should make the change log more comprehensible for other operators. I have some ideas for improving what's there, but if you'd like to give that a try first, please feel free.

Just go ahead, I'm still too much in the technical mindset right now.

comment:61 Changed 2 years ago by karsten

Alright, please find branch task-18910-4 in my repository.

Things left to do:

  • Let me know if the change log looks okay.
  • Find out why unit tests fail.
  • Upgrade to next metrics-lib as soon as it's released.
  • Test on a server.
  • Squash commits and merge.

comment:62 in reply to:  61 ; Changed 2 years ago by iwakeh

Replying to karsten:

Alright, please find branch task-18910-4 in my repository.

This branch doesn't exist yet.

comment:63 in reply to:  62 Changed 2 years ago by karsten

Replying to iwakeh:

Replying to karsten:

Alright, please find branch task-18910-4 in my repository.

This branch doesn't exist yet.

Oops, it should exist now.

comment:64 Changed 2 years ago by iwakeh

All unit tests pass on the branch as it is. I assume there was a 'collector.properties' left in your path; the test output should state that. Which reminds me, that the path for jars should be 'generated/dist' as in metrics-lib, for example.

comment:65 Changed 2 years ago by iwakeh

Here's the diff for build.xml

diff --git a/build.xml b/build.xml
index 7266e2d..a13a97e 100644
--- a/build.xml
+++ b/build.xml
@@ -7,6 +7,7 @@
   <property name="resources" value="src/main/resources/"/>
   <property name="webappsources" value="src/main/webapp"/>
   <property name="generated" value="generated/"/>
+  <property name="dist" value="${generated}/dist"/>
   <property name="testresult" value="${generated}/test-results"/>
   <property name="testsources" value="src/test/java"/>
   <property name="testresources" value="src/test/resources/"/>
@@ -17,11 +18,11 @@
   <property name="docs" value="${generated}/javadoc/"/>
   <property name="libs" value="lib"/>
   <property name="cobertura.ser.file" value="${basedir}/cobertura.ser" />
-  <property name="jarfile" value="collector-${release.version}.jar" />
-  <property name="jarsourcesfile" value="collector-${release.version}-sources.jar" />
-  <property name="jarjavadocfile" value="collector-${release.version}-javadoc.jar" />
+  <property name="jarfile" value="${dist}/collector-${release.version}.jar" />
+  <property name="jarsourcesfile" value="${dist}/collector-${release.version}-sources.jar" />
+  <property name="jarjavadocfile" value="${dist}/collector-${release.version}-javadoc.jar" />
   <property name="release.tarball"
-            value="collector-${release.version}.tar.gz" />
+            value="${dist}/collector-${release.version}.tar.gz" />
   <property file="build.properties" />
   <patternset id="runtime" >
       <include name="commons-codec-1.9.jar"/>
@@ -90,10 +91,6 @@
       <fileset dir="${generated}" defaultexcludes="false" includes="**" />
     </delete>
     <delete file="${cobertura.ser.file}" quiet="true"/>
-    <delete file="${jarfile}" quiet="true"/>
-    <delete file="${jarsourcesfile}" quiet="true"/>
-    <delete file="${jarjavadocfile}" quiet="true"/>
-    <delete file="${release.tarball}"/>
   </target>
   <target name="compile" depends="init">
     <javac srcdir="${sources}"

comment:66 Changed 2 years ago by iwakeh

All tests also pass using the pre-release for descriptor-1.5.0, incl. the usually ignored test.

comment:67 Changed 2 years ago by karsten

Please take a look at the new commits in my updated task-18910-4 branch.

Remaining tasks:

  • Find out why unit tests fail. (See below.)
  • Test on a server.
  • Squash commits and merge.

Here's the test output with 4 failing tests, both on OS X and on Linux:

    [junit] Testsuite: org.torproject.collector.sync.SyncPersistenceTest
    [junit] Tests run: 40, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 30.708 sec
    [junit] 
    [junit] Testcase: testDescWriteOutput[0] took 2.044 sec
    [junit] Testcase: testDescWriteRecent[0] took 0.523 sec
    [junit] Testcase: testOutFileContent[0] took 0.396 sec
    [junit] Testcase: testRecentFileContent[0] took 0.389 sec
    [junit] Testcase: testDescWriteOutput[1] took 0.454 sec
    [junit] 	FAILED
    [junit] data used: relay-descriptors/server-descriptor/2016/10/e/3/e381ce74a1a592f6d375706665aba6d4d22923f1, relay-2016-10-02-16-05-00-server-descriptors, resulting list: [/tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/c/8/c8c3588019f7c896eb4185cfc1074cfe5eb405ea, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/4/1/4179c50d3c764bc85c9d719e14e55a6cc232a10d, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/2/0/2091f76a8256e479cbe4f57be85f87909af07236, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/b/b/bbca7ed70ba6ea88f995b067a004f5a4d0903d6e, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/d/a/dae8966ca600b46bc75ed5efb97286481e9a6876, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/e/3/e381ce74a1a592f6d375706665aba6d4d22923f1, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/e/1/e1142337dee5b890393a0891acbde51577c2b743, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/a/0/a0ed9227a9413f140445002ce412f8828591e7ec, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/5/b/5b202650802a916f1ec3a1ef36b98706e3747701, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/5/a/5a536243bf056cd7177ddfd8eb363fec978f3343] expected:<...-descriptor/2016/10/[e/3/e381ce74a1a592f6d375706665aba6d4d22923f1]> but was:<...-descriptor/2016/10/[c/8/c8c3588019f7c896eb4185cfc1074cfe5eb405ea]>
    [junit] junit.framework.AssertionFailedError: data used: relay-descriptors/server-descriptor/2016/10/e/3/e381ce74a1a592f6d375706665aba6d4d22923f1, relay-2016-10-02-16-05-00-server-descriptors, resulting list: [/tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/c/8/c8c3588019f7c896eb4185cfc1074cfe5eb405ea, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/4/1/4179c50d3c764bc85c9d719e14e55a6cc232a10d, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/2/0/2091f76a8256e479cbe4f57be85f87909af07236, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/b/b/bbca7ed70ba6ea88f995b067a004f5a4d0903d6e, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/d/a/dae8966ca600b46bc75ed5efb97286481e9a6876, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/e/3/e381ce74a1a592f6d375706665aba6d4d22923f1, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/e/1/e1142337dee5b890393a0891acbde51577c2b743, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/a/0/a0ed9227a9413f140445002ce412f8828591e7ec, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/5/b/5b202650802a916f1ec3a1ef36b98706e3747701, /tmp/junit1420314177894412746/out/relay-descriptors/server-descriptor/2016/10/5/a/5a536243bf056cd7177ddfd8eb363fec978f3343] expected:<...-descriptor/2016/10/[e/3/e381ce74a1a592f6d375706665aba6d4d22923f1]> but was:<...-descriptor/2016/10/[c/8/c8c3588019f7c896eb4185cfc1074cfe5eb405ea]>
    [junit] 	at org.torproject.collector.sync.SyncPersistenceTest.testDescWriteOutput(SyncPersistenceTest.java:179)
    [junit] 
    [junit] Testcase: testDescWriteRecent[1] took 0.057 sec
    [junit] Testcase: testOutFileContent[1] took 0.1 sec
    [junit] Testcase: testRecentFileContent[1] took 0.058 sec
    [junit] Testcase: testDescWriteOutput[2] took 1.398 sec
    [junit] Testcase: testDescWriteRecent[2] took 0.505 sec
    [junit] Testcase: testOutFileContent[2] took 0.858 sec
    [junit] Testcase: testRecentFileContent[2] took 0.722 sec
    [junit] Testcase: testDescWriteOutput[3] took 0.434 sec
    [junit] Testcase: testDescWriteRecent[3] took 0.43 sec
    [junit] Testcase: testOutFileContent[3] took 0.673 sec
    [junit] Testcase: testRecentFileContent[3] took 0.67 sec
    [junit] Testcase: testDescWriteOutput[4] took 1.336 sec
    [junit] Testcase: testDescWriteRecent[4] took 0.861 sec
    [junit] Testcase: testOutFileContent[4] took 1.484 sec
    [junit] Testcase: testRecentFileContent[4] took 1.431 sec
    [junit] Testcase: testDescWriteOutput[5] took 0.867 sec
    [junit] Testcase: testDescWriteRecent[5] took 0.8 sec
    [junit] Testcase: testOutFileContent[5] took 1.601 sec
    [junit] Testcase: testRecentFileContent[5] took 1.7 sec
    [junit] Testcase: testDescWriteOutput[6] took 0.19 sec
    [junit] 	FAILED
    [junit] data used: relay-descriptors/extra-info/2016/10/9/a/9a4b819baeeeb6952ba737b752471b8637e75a5c, relay-2016-10-02-08-05-00-extra-infos, resulting list: [/tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/c/a/ca86eb96d22d188bb574b6b329ab21e0d9243516, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/3/1/317586098443ed19b200417556a08ebc42133521, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/3/6/36691feb7cec6a9630b9ecd11a9b5dc61c147c5d, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/4/e/4ef90738e54a403b265120dcbab7b494e0c68d3b, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/8/2/82471deac7b251089a0878d29a228d4e323b823f, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/0/4/04219ada0be922fa7518d36b0d8e66afc55e8603, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/e/b/eb73b59951bc1b0403be81220fb75be464954c31, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/9/a/9a4b819baeeeb6952ba737b752471b8637e75a5c, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/6/a/6a36d4ac36447e645c91ed63633a09197b7ad97e] expected:<.../extra-info/2016/10/[9/a/9a4b819baeeeb6952ba737b752471b8637e75a5c]> but was:<.../extra-info/2016/10/[c/a/ca86eb96d22d188bb574b6b329ab21e0d9243516]>
    [junit] junit.framework.AssertionFailedError: data used: relay-descriptors/extra-info/2016/10/9/a/9a4b819baeeeb6952ba737b752471b8637e75a5c, relay-2016-10-02-08-05-00-extra-infos, resulting list: [/tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/c/a/ca86eb96d22d188bb574b6b329ab21e0d9243516, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/3/1/317586098443ed19b200417556a08ebc42133521, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/3/6/36691feb7cec6a9630b9ecd11a9b5dc61c147c5d, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/4/e/4ef90738e54a403b265120dcbab7b494e0c68d3b, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/8/2/82471deac7b251089a0878d29a228d4e323b823f, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/0/4/04219ada0be922fa7518d36b0d8e66afc55e8603, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/e/b/eb73b59951bc1b0403be81220fb75be464954c31, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/9/a/9a4b819baeeeb6952ba737b752471b8637e75a5c, /tmp/junit4173269913956546333/out/relay-descriptors/extra-info/2016/10/6/a/6a36d4ac36447e645c91ed63633a09197b7ad97e] expected:<.../extra-info/2016/10/[9/a/9a4b819baeeeb6952ba737b752471b8637e75a5c]> but was:<.../extra-info/2016/10/[c/a/ca86eb96d22d188bb574b6b329ab21e0d9243516]>
    [junit] 	at org.torproject.collector.sync.SyncPersistenceTest.testDescWriteOutput(SyncPersistenceTest.java:179)
    [junit] 
    [junit] Testcase: testDescWriteRecent[6] took 0.038 sec
    [junit] Testcase: testOutFileContent[6] took 0.048 sec
    [junit] Testcase: testRecentFileContent[6] took 0.092 sec
    [junit] Testcase: testDescWriteOutput[7] took 0.086 sec
    [junit] 	FAILED
    [junit] data used: bridge-descriptors/2016/10/extra-infos/9/f/9f88a7c2abe6665d204137ba8c2661d42e7c2829, bridge-2016-10-02-08-09-00-extra-infos, resulting list: [/tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/c/a/cab78ea0ffe9a7bc00527fef19f546c47d59f01a, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/3/4/3412a1dccd183a1c0bd1b748f34d88594be6ea52, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/3/1/31bcea576e77ba66150f7903b588c919adad849c, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/1/6/16d2b79fbd0a8567c6afd7585c775ac7745561e3, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/f/a/fabd8f614633ec2d2d405f2554e14381bc33d9cb, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/b/c/bcfcbb38b15e9b500b1a6e9b0bcbbce858660f17, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/e/1/e11c5239494bad2f6f3759f1104a2f6182beab4d, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/e/e/eee0dc51b9a0a71ba73610123b13cea212b5cf83, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/9/f/9f88a7c2abe6665d204137ba8c2661d42e7c2829, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/7/a/7a93ca1edc543e747f1157bc3a557890335311a4] expected:<...2016/10/extra-infos/[9/f/9f88a7c2abe6665d204137ba8c2661d42e7c2829]> but was:<...2016/10/extra-infos/[c/a/cab78ea0ffe9a7bc00527fef19f546c47d59f01a]>
    [junit] junit.framework.AssertionFailedError: data used: bridge-descriptors/2016/10/extra-infos/9/f/9f88a7c2abe6665d204137ba8c2661d42e7c2829, bridge-2016-10-02-08-09-00-extra-infos, resulting list: [/tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/c/a/cab78ea0ffe9a7bc00527fef19f546c47d59f01a, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/3/4/3412a1dccd183a1c0bd1b748f34d88594be6ea52, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/3/1/31bcea576e77ba66150f7903b588c919adad849c, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/1/6/16d2b79fbd0a8567c6afd7585c775ac7745561e3, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/f/a/fabd8f614633ec2d2d405f2554e14381bc33d9cb, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/b/c/bcfcbb38b15e9b500b1a6e9b0bcbbce858660f17, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/e/1/e11c5239494bad2f6f3759f1104a2f6182beab4d, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/e/e/eee0dc51b9a0a71ba73610123b13cea212b5cf83, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/9/f/9f88a7c2abe6665d204137ba8c2661d42e7c2829, /tmp/junit4453404715943167674/out/bridge-descriptors/2016/10/extra-infos/7/a/7a93ca1edc543e747f1157bc3a557890335311a4] expected:<...2016/10/extra-infos/[9/f/9f88a7c2abe6665d204137ba8c2661d42e7c2829]> but was:<...2016/10/extra-infos/[c/a/cab78ea0ffe9a7bc00527fef19f546c47d59f01a]>
    [junit] 	at org.torproject.collector.sync.SyncPersistenceTest.testDescWriteOutput(SyncPersistenceTest.java:179)
    [junit] 
    [junit] Testcase: testDescWriteRecent[7] took 0.027 sec
    [junit] Testcase: testOutFileContent[7] took 0.071 sec
    [junit] Testcase: testRecentFileContent[7] took 0.054 sec
    [junit] Testcase: testDescWriteOutput[8] took 0.07 sec
    [junit] 	FAILED
    [junit] data used: bridge-descriptors/2016/10/server-descriptors/a/8/a8a5509ad1393c8f36abd2d8f0de1bb751926872, bridge-2016-10-02-16-09-00-server-descriptors, resulting list: [/tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/c/3/c32acc52826b37e5319f1bac2f8812b33a772540, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/0/7/07d952e9020cb68a63d9156653a2e41af4af4d44, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/0/a/0a65c636a20631bd5deb2f10dc664d2032303c46, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/e/5/e5d46e88cb52d4fc0524398cfb7a6754394bc5e9, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/6/1/614414898ee133ee9bf6b10a9898cab518c5453e, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/7/8/784d6f9e82426072fbfc7a42f8f7133ba6ec1453, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/a/8/a8a5509ad1393c8f36abd2d8f0de1bb751926872, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/5/b/5b20316b03afc98a165a219044b6fa6ca34c58ab] expected:<.../server-descriptors/[a/8/a8a5509ad1393c8f36abd2d8f0de1bb751926872]> but was:<.../server-descriptors/[c/3/c32acc52826b37e5319f1bac2f8812b33a772540]>
    [junit] junit.framework.AssertionFailedError: data used: bridge-descriptors/2016/10/server-descriptors/a/8/a8a5509ad1393c8f36abd2d8f0de1bb751926872, bridge-2016-10-02-16-09-00-server-descriptors, resulting list: [/tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/c/3/c32acc52826b37e5319f1bac2f8812b33a772540, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/0/7/07d952e9020cb68a63d9156653a2e41af4af4d44, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/0/a/0a65c636a20631bd5deb2f10dc664d2032303c46, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/e/5/e5d46e88cb52d4fc0524398cfb7a6754394bc5e9, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/6/1/614414898ee133ee9bf6b10a9898cab518c5453e, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/7/8/784d6f9e82426072fbfc7a42f8f7133ba6ec1453, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/a/8/a8a5509ad1393c8f36abd2d8f0de1bb751926872, /tmp/junit6983894424845948870/out/bridge-descriptors/2016/10/server-descriptors/5/b/5b20316b03afc98a165a219044b6fa6ca34c58ab] expected:<.../server-descriptors/[a/8/a8a5509ad1393c8f36abd2d8f0de1bb751926872]> but was:<.../server-descriptors/[c/3/c32acc52826b37e5319f1bac2f8812b33a772540]>
    [junit] 	at org.torproject.collector.sync.SyncPersistenceTest.testDescWriteOutput(SyncPersistenceTest.java:179)
    [junit] 
    [junit] Testcase: testDescWriteRecent[8] took 0.025 sec
    [junit] Testcase: testOutFileContent[8] took 0.036 sec
    [junit] Testcase: testRecentFileContent[8] took 0.036 sec
    [junit] Testcase: testDescWriteOutput[9] took 0.192 sec
    [junit] Testcase: testDescWriteRecent[9] took 0.086 sec
    [junit] Testcase: testOutFileContent[9] took 0.1 sec
    [junit] Testcase: testRecentFileContent[9] took 0.109 sec

comment:68 Changed 2 years ago by iwakeh

I'll post a fix for the tests soon.

comment:69 Changed 2 years ago by iwakeh

Please review the extended tests here.

The reason was that at the beginning I simply relied on receiving the same list order and just verifying the first path in the out list. The breaking tests reminded me of that.
The expanded tests verify all paths created for 'out'.

(Actually, I did clone your branch anew (new folder even) before comment:62, but only now received the commits that broke the tests. Slow update? Hmm.)

comment:70 Changed 2 years ago by karsten

Great, that fixes the tests. Cherry-picked, amended to remove unused import, pushed to my task-18910-4 branch.

Next steps: test some more on a server, then squash commits, then merge to master?

comment:71 Changed 2 years ago by karsten

Please find branch task-18910-5 in my repository which contains the exact same code as task-18910-4 but with squashed commits.

comment:72 Changed 2 years ago by iwakeh

Thanks!
Could you rebase on master to also catch the changes for #20162 and other merged things for testing and operator manual review?

comment:73 Changed 2 years ago by iwakeh

Here is a rebased branch 5, which can be used for testing.

comment:74 Changed 2 years ago by karsten

Sorry, just wanted to look into this, but you were faster. :)

comment:75 Changed 2 years ago by iwakeh

There's an additional commit on top of my rebased branch replacing a character that made SanitizedBridgesWriterTest fail (in an en_US.UTF-8 environment).

comment:76 in reply to:  75 Changed 2 years ago by karsten

Replying to iwakeh:

There's an additional commit on top of my rebased branch replacing a character that made SanitizedBridgesWriterTest fail (in an en_US.UTF-8 environment).

Good catch!

How's the testing going? Should I merge to master?

comment:77 Changed 2 years ago by iwakeh

All is working fine on the mirror :-)

And, the availability of descriptors already increased tremendously even though sync has been running less than three days yet.
I'll add a wiki page with numbers and findings later.

comment:78 Changed 2 years ago by karsten

Cherry-picked all new commits from your branch and pushed to master. Yay!

comment:79 Changed 2 years ago by karsten

First bug report: configuration property "ExitlistSources" is disregarded. I believe we're looking at the non-existent property "ExitlistsSources" (plural).

comment:80 Changed 2 years ago by karsten

Second bug report: I believe that we're disregarding StandardOpenOption.CREATE_NEW in PersistenceUtils.storeToFileSystem() since the switch to temporary files. We're checking that the temporary file does not exist, writing to it, and later renaming to the original file, thus replacing it. What we should really do, when storing a file with StandardOpenOption.CREATE_NEW, is check whether the file exists and return false.

comment:81 in reply to:  79 ; Changed 2 years ago by iwakeh

Replying to karsten:

First bug report: configuration property "ExitlistSources" is disregarded. I believe we're looking at the non-existent property "ExitlistsSources" (plural).

Good catch! I'll work on a patch and a test preventing this in future.

comment:82 Changed 2 years ago by karsten

Here's another one. The path for microdescriptor consensuses is wrong, which leads to us not syncing any of those:

diff --git a/src/main/java/org/torproject/collector/relaydescs/ArchiveWriter.java b/src/main/java/org/torproject/collector/relaydescs/ArchiveWriter.java
index 30b6569..20d0663 100644
--- a/src/main/java/org/torproject/collector/relaydescs/ArchiveWriter.java
+++ b/src/main/java/org/torproject/collector/relaydescs/ArchiveWriter.java
@@ -99,7 +99,7 @@ public class ArchiveWriter extends CollecTorMain {
     this.mapPathDescriptors.put("recent/relay-descriptors/consensuses",
         RelayNetworkStatusConsensus.class);
     this.mapPathDescriptors.put(
-        "relay-descriptors/microdescs/consensus-microdesc",
+        "recent/relay-descriptors/microdescs/consensus-microdesc",
         RelayNetworkStatusConsensus.class);
     this.mapPathDescriptors.put("recent/relay-descriptors/server-descriptors",
         RelayServerDescriptor.class);

Related to this, I believe that we'd use sync/sync-history-collector.torproject.org-Relay-RelayNetworkStatusConsensus for the parse history, which would be the exact same file as for unflavored consensuses. We should use a different parse history file here.

comment:83 in reply to:  81 ; Changed 2 years ago by iwakeh

Replying to iwakeh:

Replying to karsten:

First bug report: configuration property "ExitlistSources" is disregarded. I believe we're looking at the non-existent property "ExitlistsSources" (plural).

Good catch! I'll work on a patch and a test preventing this in future.

Here is the branch with a patch and test. Please review.

comment:84 in reply to:  83 Changed 2 years ago by karsten

Replying to iwakeh:

Replying to iwakeh:

Replying to karsten:

First bug report: configuration property "ExitlistSources" is disregarded. I believe we're looking at the non-existent property "ExitlistsSources" (plural).

Good catch! I'll work on a patch and a test preventing this in future.

Here is the branch with a patch and test. Please review.

Looks good. Merged to master!

comment:85 Changed 2 years ago by iwakeh

Please find four more commits on the above branch addressing

  • the microcons-path
  • the sync history files
  • and the storage.

For the latter I added a PersistenceUtils test class and corrected the behavior you described in an earlier comment:80 (will the comment count reach 100 here ;-)

Thanks for looking so carefully through all the code!

comment:86 Changed 2 years ago by iwakeh

Addendum to the hist-file topic:
The added string from the path is just a work-around. There should be different interfaces for consensus flavors (#17861 and #19640) before there are too many flavors.

comment:87 in reply to:  85 Changed 2 years ago by karsten

Replying to iwakeh:

Please find four more commits on the above branch addressing

  • the microcons-path
  • the sync history files
  • and the storage.

Looks good, merged to master!

For the latter I added a PersistenceUtils test class and corrected the behavior you described in an earlier comment:80 (will the comment count reach 100 here ;-)

Thanks for looking so carefully through all the code!

Sure! I'll start a new test and let it run tonight.

comment:88 Changed 2 years ago by karsten

Here's a fresh bug report: synchronizing relay descriptors takes a few hours on average on my laptop, and I think I know why. Whenever we append to an existing file in recent/, we copy the entire file content to a temporary file, append the new content there, and then rename. And we do that for every single server descriptor or extra-info descriptor we receive. That is bad for two reasons. First reason is that it obviously doesn't scale, with some sync runs taking 4 hours or longer. Second reason is that files in recent/ shouldn't change after they're included in index.json (with the inglorious exception of Torperf measurements), which is why we should append all descriptors to a temporary file and rename when we're sure we're done. See also ArchiveWriter.cleanUpRsyncDirectory() where we're renaming temporary files to destination files at the very end of the update run.

comment:89 Changed 2 years ago by iwakeh

  1. Valid points about file-io. My local runs identified the network as the bottleneck rather than file-io when doing an initial sync to an empty CollecTor instance. Subsequent runs were way shorter. The mirror as a running Collector instance needed only 20 min on the first sync-run and now way less (3 to 1 min). But anyway, it's true that some of the copying could and should be reduced.
  1. Ideally index.json should be a true picture of 'recent', but actually, it'll always only be a snapshot, even if it's updated with each change, b/c then the syncing instance cannot update index.json continuously. So, CollecTor's sync should accommodate the possible differences, which it does currently, I think.

How to proceed?
Do you think this is a halt to the release?
I think it can be released as is, because the current set-up increases descriptor availability a lot and is tested.
I'm wary of tuning it now without a release delay. And, regarding both writing and parsing there are duplicate and trip-licate implementations in the code-base, which should be streamlined and can be tuned in that process.

I'd suggest to release and have new tickets (which will be part of the other tickets for planning the streamlining, modularization, and other improvements):

  1. streamline writing all over the code-base with an emphasis on reducing file-io for CollecTor;
  2. Make index.json as close to the current state as necessary and feasible, which includes pondering about how accurate it should be with the given use-cases. Maybe have a clean-up module before index-run.

Is that an ok plan?

comment:90 Changed 2 years ago by karsten

Well, the sync runs don't take just 3 or 20 minutes here, but many hours. I could imagine that it's the backup daemon trying to capture all file system changes for the next backup, or something. I could imagine that similar things can happen on servers. The issue is that the update run cannot start if the sync run keeps running, and that means we'll lose data. I'm afraid we can't ship this yet.

I don't understand your reasoning about index.json not being a true picture of recent/. We're skipping *.tmp files when creating that file, and we always append to .tmp and only rename to the destination file when we're sure the file won't change anymore. Where does that get inaccurate?

I'd rather want to change the sync code to do the same as the update code. I can give that a try if you don't want to touch that code anymore.

By the way, we'll need to merge #20380 before putting out the release. And I'd want to start a test run over night before releasing, so the release cannot happen today anyway. :(

comment:91 in reply to:  90 ; Changed 2 years ago by iwakeh

Replying to karsten:

Well, the sync runs don't take just 3 or 20 minutes here, but many hours. I could imagine that it's the backup daemon trying to capture all file system changes for the next backup, or something. I could imagine that similar things can happen on servers.

Well, please investigate, if the backup is the reason. A backup shouldn't hamper the productive system.

Just to make sure we're talking about the same use case:

  • A fresh installation without previous data is not what sync is for. Here the archived data of the last three days should be provided and one or two regular download runs. After that, sync can be turned on.
  • A running instance like a mirror can be enhanced with the sync.


I don't understand your reasoning about index.json not being a true picture of recent/. We're skipping *.tmp files when creating that file, and we always append to .tmp and only rename to the destination file when we're sure the file won't change anymore. Where does that get inaccurate?

Now I see we're talking about different accuracies:

You're referring to the single file assembled in one download (or hopefully soon sync-run). Thus, the index.json of a syncing instance could become inaccurate in this case.

I'm referring to the accuracy lost by regular operation after creation or download of index.json. For example ticket #20430, the syncing instance retrieves index.json, shortly after that the main instance has a clean-up run, and thus files listed in index.json don't exist anymore.

...

By the way, we'll need to merge #20380 before putting out the release. And I'd want to start a test run over night before releasing, so the release cannot happen today anyway. :(

That is vital information.
When postponing the release there is time for changing the code and testing, that's what I said before.
I'll take a look.

comment:92 in reply to:  91 Changed 2 years ago by karsten

Replying to iwakeh:

Replying to karsten:

Well, the sync runs don't take just 3 or 20 minutes here, but many hours. I could imagine that it's the backup daemon trying to capture all file system changes for the next backup, or something. I could imagine that similar things can happen on servers.

Well, please investigate, if the backup is the reason. A backup shouldn't hamper the productive system.

This was just a guess. I didn't further investigate after finding out what the code was doing. And even if backup daemons were not the reason for the slowness here, we shouldn't ship this code, now that we know about its inefficiency.

Just to make sure we're talking about the same use case:

  • A fresh installation without previous data is not what sync is for. Here the archived data of the last three days should be provided and one or two regular download runs. After that, sync can be turned on.
  • A running instance like a mirror can be enhanced with the sync.

Why would we discourage turning on sync right from the start? In this case, it was not the first run that was slow, but later sync runs were even slower.

I don't understand your reasoning about index.json not being a true picture of recent/. We're skipping *.tmp files when creating that file, and we always append to .tmp and only rename to the destination file when we're sure the file won't change anymore. Where does that get inaccurate?

Now I see we're talking about different accuracies:

You're referring to the single file assembled in one download (or hopefully soon sync-run). Thus, the index.json of a syncing instance could become inaccurate in this case.

Yes, that's what I was referring to.

I'm referring to the accuracy lost by regular operation after creation or download of index.json. For example ticket #20430, the syncing instance retrieves index.json, shortly after that the main instance has a clean-up run, and thus files listed in index.json don't exist anymore.

True, that's unrelated to what I meant above.

By the way, we'll need to merge #20380 before putting out the release. And I'd want to start a test run over night before releasing, so the release cannot happen today anyway. :(

That is vital information.
When postponing the release there is time for changing the code and testing, that's what I said before.
I'll take a look.

If you have some code for me today, I'll run it over night, and maybe we can release tomorrow! :)

comment:93 Changed 2 years ago by iwakeh

Please find another commit on this branch.

comment:94 in reply to:  93 Changed 2 years ago by karsten

Replying to iwakeh:

Please find another commit on this branch.

Looks good, merged to master and starting a new test run now. Thanks!

comment:95 Changed 2 years ago by karsten

New bug, though I'm not certain yet where to fix it: the DescriptorIndexCollector instance of one module deletes the synchronized descriptors stored by the instance of another module. Look at the following (new) code:

    DescriptorCollector dc =
        DescriptorSourceFactory.createDescriptorCollector();
    dc.collectDescriptors("https://collector.torproject.org",
        new String[] { "/recent/torperf/" }, 0L, new File("sync"), true);
    dc.collectDescriptors("https://collector.torproject.org",
        new String[] { "/recent/exit-lists/" }, 0L, new File("sync"), true);

What would you expect that code to do? Populate two local directories sync/recent/torperf/ and sync/recent/exit-lists/? Not quite, the last line clears all descriptors in sync/recent/torperf/. Uhmmmmm. And I'm not even sure what DescriptorCollectorImpl would do, possibly the same thing.

We can either work around this in CollecTor by using separate subdirectories for the modules, like sync/Relay/collector.torproject.org/recent/relay-descriptors/..., or we can decide this is a (usability) bug in metrics-lib, put out a 1.5.1, and switch.

comment:96 Changed 2 years ago by iwakeh

Please review this commit.
This changes the top-level name to marker and host, i.e. 'Relay-collector.torproject.org/...'

Last edited 2 years ago by iwakeh (previous) (diff)

comment:97 in reply to:  96 ; Changed 2 years ago by karsten

Replying to iwakeh:

Please review this commit.
This changes the top-level name to marker and host, i.e. 'Relay-collector.torproject.org/...'

Looks good, merged, and resumed testing.

comment:98 in reply to:  97 Changed 2 years ago by karsten

Replying to karsten:

Replying to iwakeh:

Please review this commit.
This changes the top-level name to marker and host, i.e. 'Relay-collector.torproject.org/...'

Looks good, merged, and resumed testing.

FYI, pushed another commit as fixup of that last commit. (The path had to be changed in two places.) Testing more over night.

comment:99 Changed 2 years ago by iwakeh

Thanks for fixing this!

I added #20489 in order to also create tests for these issues and any others that do not have tests yet. The path-stuff will surely be touched again in future and then we'll be able catch problems in the unit-tests.

comment:100 in reply to:  99 Changed 2 years ago by karsten

Replying to iwakeh:

Thanks for fixing this!

Sure. Also, tests looked good, I'd say we're ready to release as soon as #20380 is in.

I added #20489 in order to also create tests for these issues and any others that do not have tests yet. The path-stuff will surely be touched again in future and then we'll be able catch problems in the unit-tests.

Sounds good, thanks!

comment:101 Changed 2 years ago by iwakeh

With the 101st comment:

Closing.

Thanks!

comment:102 Changed 2 years ago by iwakeh

Resolution: implemented
Status: needs_reviewclosed
Note: See TracTickets for help on using tickets.