All storing of files should be done by the persist-mechanism, i.e.,
a o.t.c.persist.*Persistence classes should also be used for storing descriptors initially (i.e., when fetched from authorities).
Goal:
We want to identify a common approach for storing descriptors during sync-runs and imports. The various modules should diverge from it as little as possible. The resulting changes (once the approach is defined) will be implemented in the various modules and the implementation of these changes will make use of the persistence classes in order to avoid code multiplication for file storing/writing/renaming/etc.
Currently, there is quite a variety of handling storage that is in many cases historically grown but not due necessity (cf. comment:1).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Here's an overview (cf. #21759 (moved) comment:12 following):
bridge-desc (all types): after sanitation the descriptor is written; if one descriptor cannot be sanitized, it is skipped
relay-desc (all types): descriptors written one by one skipping problematic ones
exitlists: always stored as a single file.
onionperf: currently implemented using an implicit transaction, i.e., all descriptors in one downloaded descriptor file are only stored, if all were valid. This is different from the sync-approach where invalid/unparseable descriptors are ignored, but valid ones stored no matter if they came in one file.
Setting to high as this discussion halts simplification of most modules and it could also impact #22428 (moved).
We need to decide which persistence approach to use for all modules (see comment:1).
Here's an overview (cf. #21759 (moved) comment:12 following):
Some quick thoughts:
bridge-desc (all types): after sanitation the descriptor is written; if one descriptor cannot be sanitized, it is skipped
Sounds reasonable. The decision whether to skip a descriptor or not should remain outside of the persistence module. We could easily remove synchronization functionality, because a CollecTor instance either sanitizes bridge descriptors or copies them over from another instance, but not both. But I don't mind keeping it as long as it doesn't get in the way by making designs more complex than they have to be.
relay-desc (all types): descriptors written one by one skipping problematic ones
I wonder if we should take out the part where we're skipping problematic descriptors, so that we handle descriptors coming from directory authorities and from other CollecTor instances the same. We only need basic things like publication time, descriptor digest, etc. to determine file names. But maybe it shouldn't be on us to decide about rejecting a descriptor. Needs discussion.
exitlists: always stored as a single file.
Yup. Nothing special here, I think.
onionperf: currently implemented using an implicit transaction, i.e., all descriptors in one downloaded descriptor file are only stored, if all were valid. This is different from the sync-approach where invalid/unparseable descriptors are ignored, but valid ones stored no matter if they came in one file.
The implicit transaction thing was a mistake that we should fix. But similar to relaydescs, we should think about accepting all measurements containing just enough data to put them into the right directories. We can still warn about validation errors, just like we do when downloading relaydescs, but we'd store them anyway.
Regarding webstats: Maybe we can take a similar approach like we have for bridgedescs where the sanitizer decides what files or lines go through and the persistence layer just stores what it gets as long as it has the necessary metadata. We don't really need synchronization here, for the same reason as we don't need it for bridgedescs.
I added some text to the description of this ticket, because it was a little short. Please, refer to the description before continuing.
(I tried to take your comment:6 into account in the following.)
What about the following general rules for all modules:
During descriptor imports:
After parsing (and possibly sanitation) the descriptor is written; if one descriptor cannot be parsed (and possibly sanitized), it is skipped with a debug log message, unless it contains at least enough information to sort it into out&recent (for relays and bridges and onionperf and ?).
During sync-runs:
Descriptors that cannot be parsed should be dropped. Or, do we want to sync invalid descriptors?
Imo, syncs should only take valid stuff and there should be sync-runs for all descriptors (just not necessarily enabled on every instance).
All storing of files should be done by the persist-mechanism, i.e.,
a o.t.c.persist.*Persistence classes should also be used for storing descriptors initially (i.e., when fetched from authorities).
All storing of files should be done by the persist-mechanism, i.e.,
a o.t.c.persist.*Persistence classes should also be used for storing descriptors initially (i.e., when fetched from authorities).
Goal:
We want to identify a common approach for storing descriptors during sync-runs and imports. The various modules should diverge from it as little as possible. The resulting changes (once the approach is defined) will be implemented in the various modules and the implementation of these changes will make use of the persistence classes in order to avoid code multiplication for file storing/writing/renaming/etc.
Currently, there is quite a variety of handling storage that is in many cases historically grown but not due necessity (cf. comment:1).
Let's assume for a moment that the answer is: yes.
Right now, the main purpose of having synchronization is to make relay descriptor collection more robust by having two separate collectors and an easy way to synchronize missing descriptors from one to the other.
In a way, synchronization is yet another descriptor source, just like reading from cached-descriptors files or downloading from directory authorities.
So, maybe we can turn the question around: do we want to handle descriptors coming from one source differently than those coming from another source?
If we put the bar higher, I'm worried that we're missing data.
Imagine that the main instance had a temporary failure and didn't fetch a descriptor, but the backup instance fetched it, found it invalid, and stored it anyway. If the main instance discards that descriptor during synchronization we'll never add it to the archive. And what if we only thought it's invalid because of an implementation bug?
To be honest, I'm only 90% certain that we should treat descriptors from those different sources the same, and I'm curious what you think based on my comment here.
Regardless of this specific aspect, I'm all for simplifying things! Thanks for pushing this forward!
The thought that invalid descriptors are mainly due to CollecTor's parsing mechanism not recognizing them as valid is a good point in favor of storing and syncing invalid descriptors.
There might be invalid descriptors - mangled or not complying to the spec - but even these will be useful for analysis and troubleshooting.
As we only sync between highly trusted instances the possibility of maliciously malformed descriptors can be ruled out (well, if that happens there is another bigger problem to deal with).
So, given that syncing only takes place between trusted instances and data loss is the main evil to prevent the sync&store-all approach is fine:
Only during import of sensitive data descriptors that cannot be sanitized are skipped, other than that all descriptors are stored.
Possible next steps (if we agree on the above):
Make webstats module use the above approach from the beginning, if it seems easier, also immediately change the over all sync-process.
Unless the change was made for all in step one, make the entire sync-process keep all descriptors.
Change and adapt all other CollecTor modules accordingly using persistence classes throughout.
The thought that invalid descriptors are mainly due to CollecTor's parsing mechanism not recognizing them as valid is a good point in favor of storing and syncing invalid descriptors.
There might be invalid descriptors - mangled or not complying to the spec - but even these will be useful for analysis and troubleshooting.
As we only sync between highly trusted instances the possibility of maliciously malformed descriptors can be ruled out (well, if that happens there is another bigger problem to deal with).
So, given that syncing only takes place between trusted instances and data loss is the main evil to prevent the sync&store-all approach is fine:
Only during import of sensitive data descriptors that cannot be sanitized are skipped, other than that all descriptors are stored.
Makes sense to me.
Possible next steps (if we agree on the above):
Make webstats module use the above approach from the beginning, if it seems easier, also immediately change the over all sync-process.
Unless the change was made for all in step one, make the entire sync-process keep all descriptors.
Change and adapt all other CollecTor modules accordingly using persistence classes throughout.
How about we don't enable syncing for the new webstats module at all? Let's face it, it's not a use case we're planning to support, so why should we write or keep the necessary code to do it?
And going even one step further (out of scope for this ticket), how about we disable syncing for all other modules except for the relaydescs module where we turn it into yet another data source like downloading from the authorities or reading from cached descriptors files? We could still keep the code in a form that we can add re-use it in other modules in the future, but only as long as that doesn't make the overall code more complex than it has to be. But: new ticket. Just writing this here to discuss the general direction.
...
How about we don't enable syncing for the new webstats module at all? Let's face it, it's not a use case we're planning to support, so why should we write or keep the necessary code to do it?
I think CollecTor should provide syncing for all modules. This might not be necessary for the operation of tp.o instances, but is a valid use case for maybe research or other non-TorMetrics uses.
The sync-functionality is designed in a way that it hardly needs any code to add a new module, integrating a module to sync is very easy and cheap in terms of development time.
And going even one step further (out of scope for this ticket), how about we disable syncing for all other modules except for the relaydescs module where we turn it into yet another data source like downloading from the authorities or reading from cached descriptors files? We could still keep the code in a form that we can add re-use it in other modules in the future, but only as long as that doesn't make the overall code more complex than it has to be. But: new ticket. Just writing this here to discuss the general direction.
Again, adding sync is hardly any coding. Making sync available for all descriptor types is very easy and a benefit in general. Whether it should be turned on for the various tp.o CollecTor instances is a different question.
This discussion should probably take place elsewhere. Some quick thoughts anyway:
I'd want us to integrate synchronization closer into the relaydescs module as it is possible while keeping it general-purpose.
For example, I'd like to do it before downloading missing descriptors from directory authorities, because it might save us a few requests there.
Further, right now the sync run produces its own file in the recent/ directory, but ideally there should only be a single such file per execution. That's mostly a cosmetic issue, though. The previous one is more serious.
Researchers or other non-Tor-Metrics users don't need synchronization, they need download functionality. They could simply use DescriptorCollector, which performs basic synchronization functionality like skipping files that exist remotely and deleting files that don't exist remotely anymore.
Even if we don't have to add much code now, we might be keeping a lot of code that we don't need, and that produces maintenance effort. This doesn't make it super urgent to get rid of that code. But it's worth thinking about removing that code in the future if we don't need it.
If we need to think about what to do with synchronization whenever we add or change a module, that creates overhead, too.
We should keep code if we believe that it either has a benefit now or will be useful in the future. Let's think about possible uses of synchronization outside of the relaydescs module (where it has turned out to be tremendously useful!). I don't see how we're using it elsewhere. I could be overlooking it. But if it turns out to be very unlikely that we're deploying synchronization for another module in the next 24 months, let's make plans to remove it and integrate it further into the relaydescs module.
However, I think we agree on the actual topic which is persistence. So, let's move forward with that and table the (somewhat related) synchronization discussion.