Opened 2 years ago

Last modified 16 months ago

#23421 assigned enhancement

Use persistence functionality throughout all modules — at Version 7

Reported by: iwakeh Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: metrics-2018
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by iwakeh)

As discussed in #21272 comment:18 and before.

All storing of files should be done by the persist-mechanism, i.e.,
a o.t.c.persist.*Persistence classes should also be used for storing descriptors initially (i.e., when fetched from authorities).

Goal:
We want to identify a common approach for storing descriptors during sync-runs and imports. The various modules should diverge from it as little as possible. The resulting changes (once the approach is defined) will be implemented in the various modules and the implementation of these changes will make use of the persistence classes in order to avoid code multiplication for file storing/writing/renaming/etc.
Currently, there is quite a variety of handling storage that is in many cases historically grown but not due necessity (cf. comment:1).

Child Tickets

Change History (7)

comment:1 Changed 2 years ago by iwakeh

Status: newneeds_information

Here's an overview (cf. #21759 comment:12 following):

  • bridge-desc (all types): after sanitation the descriptor is written; if one descriptor cannot be sanitized, it is skipped
  • relay-desc (all types): descriptors written one by one skipping problematic ones
  • exitlists: always stored as a single file.
  • onionperf: currently implemented using an implicit transaction, i.e., all descriptors in one downloaded descriptor file are only stored, if all were valid. This is different from the sync-approach where invalid/unparseable descriptors are ignored, but valid ones stored no matter if they came in one file.
Last edited 2 years ago by iwakeh (previous) (diff)

comment:2 Changed 2 years ago by karsten

Summary: use persistence functionality throughout all CollecTor modulesUse persistence functionality throughout all modules

Tweak summary.

comment:3 Changed 2 years ago by karsten

Keywords: metrics-2018 added

comment:4 Changed 2 years ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:5 Changed 2 years ago by iwakeh

Priority: MediumHigh

Setting to high as this discussion halts simplification of most modules and it could also impact #22428.
We need to decide which persistence approach to use for all modules (see comment:1).

comment:6 in reply to:  1 Changed 23 months ago by karsten

Replying to iwakeh:

Here's an overview (cf. #21759 comment:12 following):

Some quick thoughts:

  • bridge-desc (all types): after sanitation the descriptor is written; if one descriptor cannot be sanitized, it is skipped

Sounds reasonable. The decision whether to skip a descriptor or not should remain outside of the persistence module. We could easily remove synchronization functionality, because a CollecTor instance either sanitizes bridge descriptors or copies them over from another instance, but not both. But I don't mind keeping it as long as it doesn't get in the way by making designs more complex than they have to be.

  • relay-desc (all types): descriptors written one by one skipping problematic ones

I wonder if we should take out the part where we're skipping problematic descriptors, so that we handle descriptors coming from directory authorities and from other CollecTor instances the same. We only need basic things like publication time, descriptor digest, etc. to determine file names. But maybe it shouldn't be on us to decide about rejecting a descriptor. Needs discussion.

  • exitlists: always stored as a single file.

Yup. Nothing special here, I think.

  • onionperf: currently implemented using an implicit transaction, i.e., all descriptors in one downloaded descriptor file are only stored, if all were valid. This is different from the sync-approach where invalid/unparseable descriptors are ignored, but valid ones stored no matter if they came in one file.

The implicit transaction thing was a mistake that we should fix. But similar to relaydescs, we should think about accepting all measurements containing just enough data to put them into the right directories. We can still warn about validation errors, just like we do when downloading relaydescs, but we'd store them anyway.

Regarding webstats: Maybe we can take a similar approach like we have for bridgedescs where the sanitizer decides what files or lines go through and the persistence layer just stores what it gets as long as it has the necessary metadata. We don't really need synchronization here, for the same reason as we don't need it for bridgedescs.

Is this the information you were looking for?

comment:7 Changed 23 months ago by iwakeh

Description: modified (diff)

I added some text to the description of this ticket, because it was a little short. Please, refer to the description before continuing.

(I tried to take your comment:6 into account in the following.)

What about the following general rules for all modules:

During descriptor imports:
After parsing (and possibly sanitation) the descriptor is written; if one descriptor cannot be parsed (and possibly sanitized), it is skipped with a debug log message, unless it contains at least enough information to sort it into out&recent (for relays and bridges and onionperf and ?).

During sync-runs:
Descriptors that cannot be parsed should be dropped. Or, do we want to sync invalid descriptors?
Imo, syncs should only take valid stuff and there should be sync-runs for all descriptors (just not necessarily enabled on every instance).

Note: See TracTickets for help on using tickets.