Opened 2 years ago

Last modified 9 months ago

#20228 assigned enhancement

Append all votes with same valid-after time to a single file in `recent/`

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: iwakeh Actual Points:
Parent ID: #20519 Points:
Reviewer: Sponsor:

Description

We're currently creating a new file per vote in recent/relay-descriptors/votes/, which might be excessive. We could easily append all votes with the same valid-after time to a single file there, so instead of:

2016-09-23-14-00-00-vote-EFCBE720AB3A82B99F9E953CD5BF50F7EEFC7B97-927994982CFB4E2F24D22D0B74D693574EC04DE5
2016-09-23-14-00-00-vote-ED03BB616EB2F60BEC80151114BB25CEF515B226-915DB92F8614D94EB6390621EF4ADD65510A6AB7
2016-09-23-14-00-00-vote-D586D18309DED4CD6D57C18FDB97EFA96D330566-C77126E7595C15F242DF13F740528526E71CC063
2016-09-23-14-00-00-vote-49015F787433103580E3B66A1707A00E60F2D15B-1307B591CA002EB4FD55E5B183F8D757A64F0963
2016-09-23-14-00-00-vote-23D15D965BC35114467363C165C4F724B64B4F66-9489196D1A9647F7A0B6CEDD3E48C7CFAECC57F0
2016-09-23-14-00-00-vote-14C131DFC5C6F93646BE72FA1401C02A8DF2E8B4-8C8DEA53F89447781098CFCDA9A59ED8B6987C96
2016-09-23-14-00-00-vote-0232AF901C31A04EE9848595AF9BB7620D4C5B2E-8E11EBEEC56D1DBECC9105192F0036292B35A721

we'd just provide a single file:

2016-09-23-14-00-00-votes

just like we're providing a single file for the consensus:

2016-09-23-14-00-00-consensus

I just looked at the index.json we provide, and of the 1825 files in recent/, 504 are votes (28%). My rough estimate is that we'd cut down the size of index.json to 75% of its current size.

The question is whether anybody downloads these votes (manually) and relies on them being contained in separate files.

Note that this change would not affect how votes are stored in tarballs. They can stay in separate files there.

Child Tickets

Change History (15)

comment:1 Changed 2 years ago by iwakeh

This is a very useful suggestion!
And, we should make it part of the first protocol version (#20234).
A timely notice about that change should be ok?

comment:2 Changed 2 years ago by karsten

Hmm, here's something we still need to decide: it may happen that we're learning about a vote out of band. For example, we may only receive six votes for a given valid-after time and learn about a seventh vote in the next run when synchronizing from other CollecTor instances. We could do two things in that situation: store votes under the time when we stored them (e.g., 2016-09-23-15-05-03-votes) or append to an existing votes file and hence modifying its size and last-modified time (e.g., 2016-09-23-15-00-00-votes). We're doing the second variant for server descriptors and other descriptors where we're appending descriptors. That's also my favorite solution, except there are benefits of the first variant (or a third variant) that I didn't see. Hmm.

Regarding the notice, how about we send something to tor-dev@ and give two weeks time (because everyone's busy with the meeting)? Or would that delay other things too much?

comment:3 Changed 2 years ago by iwakeh

  • Regarding grouping by download vs. published time which came up in #20234, too. Let's have the discussion for all descriptors here, if this is ok?
    1. Grouping by published time brings more data consistency between CollecTor instances, as their download times for the same descriptors surely differ often.
    2. Grouping by download time means keeping track of a data item, i.e. download time, that so far is not part of the Tor protocol. Why introduce it for descriptors that provide a published time? Which is the download time after syncing descriptors: the initial download by the supplying CollecTor or the sync-download-time by the receiving one?
    3. Regarding #20234:comment:5: Clients might not be interested in past or future (according published time) descriptors and just download the file they consider current, if it changed since their last visit.
  • Regarding the notice: I think the two week time frame is fine.

comment:4 in reply to:  3 Changed 2 years ago by karsten

Replying to iwakeh:

  • Regarding grouping by download vs. published time which came up in #20234, too. Let's have the discussion for all descriptors here, if this is ok?
    1. Grouping by published time brings more data consistency between CollecTor instances, as their download times for the same descriptors surely differ often.

Agreed, I guess we can assume that files in the recent/ directories might differ between CollecTor instances. But is that important, as long as the set of contained descriptors with publication time in the past, say, 60 hours is 99.9% the same? I mean, it's still possible and very likely that files by publication hour would contain descriptors in different orders. Do we care?

  1. Grouping by download time means keeping track of a data item, i.e. download time, that so far is not part of the Tor protocol. Why introduce it for descriptors that provide a published time? Which is the download time after syncing descriptors: the initial download by the supplying CollecTor or the sync-download-time by the receiving one?

Right now, a CollecTor instance records the timestamp when starting to download and uses that as file name for the descriptors file where it appends all descriptors it learns about in that run. That would include descriptors found via initial download or via synchronization from other instances. And 72 hours later, when the file gets deleted, the download time will not be relevant anymore.

  1. Regarding #20234:comment:5: Clients might not be interested in past or future (according published time) descriptors and just download the file they consider current, if it changed since their last visit.

Right, this is an important argument for storing descriptors by published hour, so that clients can retrieve them easily. However, the presumption there is that the client knows the publication time of a descriptor before downloading something, and that's not always the case. It might be that the client would have to download several files and search for the descriptor it's looking for.

And the most important argument against storing descriptors by published hour is that clients that just want the new descriptors will have to download about 8 files per hour (due to #20234) rather than 1, where 6 or 7 of these files contain mostly the same descriptors as before.

  • Regarding the notice: I think the two week time frame is fine.

Sounds good. Let's first conclude on something here and then tell the world.

comment:5 Changed 2 years ago by karsten

Priority: MediumHigh

I'd like us to move forward here, ideally with descriptors grouped by download time and both of us being fully convinced that it's the best way forward. :)

So, let me give you some background on where the recent/ folder comes from.

A few years back, there was just the archive/ folder with tarballs that were updated every few days. All services like Tor Metrics, ExoneraTor, and Onionoo were running on the same host as CollecTor and using CollecTor's directory structure for importing new descriptors. This was very convenient for running these services, but of course very fragile and very impossible for others to run similar services. That's when I turned CollecTor into its own service.

The new CollecTor service had a local directory called rsync/, the predecessor of recent/, which had just the newest files that other services would download via rsync rather than http. The idea was to provide the latest 72 hours of descriptors, so that services can miss updates for up to 3 days (a weekend) without having to fall back to importing tarballs from the archive/ directory. This fixed the problem of running all services on one machine, but it didn't allow others to run services. We quickly learned that rsyncing thousands or even hundreds of thousands of files did not scale, so we appended many small descriptors into one file per CollecTor update run.

At some point we made that rsync/ directory available via http as recent/ and taught Onionoo et al. to download descriptors from there instead of relying on a local rsync command to magically fetch them. This is when other services could first enter the game. It's also when users started browsing the recent/ directory to have an easy way to download descriptors---but that was mostly coincidence and a nice side effect.

Now we're considering changing the directory structure to make it even more efficient for services to keep up to date. Merging votes into single files reduces the index.json* size while keeping the service exactly as useful for other services. Something that we'll make a bit more difficult is accessibility for humans, because they cannot locate a vote as easily anymore.

Also consider a feature request that people ask for every so often: provide a search for raw descriptors. This is something that folks like directory authority operators or others who debug the network would find really useful. And these folks might be sad that votes are appended to single files and stored by download time rather than valid-after time. But it's again coincidence that votes are easily locatable by valid-after time. On the other hand, if a user searches for something different, like a relay fingerprint or IP address, they'll likely have to download the latest few votes and search locally.

So, we might even go one step further and store all descriptors in the recent/ folder by download time. That would include consensuses of which there are usually only per CollecTor update run. The upside would be that it'd become more obvious that all files contain the download time, not the published or valid-after time.

All in all, I'd like to consider the recent/ folder as an update channel for services rather than something that humans browse. I'm not going to stop them from doing that, but I'm very hesitant to make the original use case of that directory less useful by supporting this new use case. And we would do that by forcing services to download multiple files containing many descriptors they already know.

Somebody should go and write a descriptor database that takes CollecTor's recent/ folder as input and provides a search interface that returns raw descriptors.

I hope this makes sense. Please let me know if it doesn't! And thanks for reading this wall of text. ;)

comment:6 Changed 2 years ago by iwakeh

Thanks for the history here! Parts like rsync to recent could be guessed from the code, but its good to have it confirmed. And, for other topics like the purpose of 'recent' it is very important and an easy read :-)

Well, so the definition/purpose for/of recent is:

Provide the latest subset of descriptors a CollecTor could acquire, where 'latest' is currently defined as three days or more current acquisition time, which can be download time or sync-time (once sync is in place).

Following this definition your second suggestion to store all descriptors in 'recent' bundled according to their acquisition time is a simple consequence.

It just needs to be stated prominently somewhere that recent is not for human-readable browsing.
(The database should not be too difficult to be supplied here soon.)

How to approach the implementation?
I'd like to have this implementation in order to simplify merging and some of the existing CollecTor code.
I could implement it together with the sync ticket (with decent commits for review). That also would be a big step forward in modularizing the Collector code and making it testable etc.

comment:7 Changed 2 years ago by karsten

Great! Glad we agree here. :) If you'd like to implement this, I'd be happy to review the resulting patches. Thanks!

comment:8 Changed 2 years ago by iwakeh

Status: newneeds_information

Having microdescs in one file grouped by download time, how can a syncing instance infer the valid-after date-time of the corresponding micro-consensus for storing the descriptors in 'out'?

comment:9 Changed 2 years ago by karsten

Good question! First answer: microdescriptors are already grouped by download time, so this issue already exists and is not something we'd introduce now.

Second answer: I remember this was painful when implementing the code a few years back. I can't provide you with a good solution right now, but I suggest reading up how the current code handles this issue. One possible starting point:

https://gitweb.torproject.org/collector.git/tree/src/main/java/org/torproject/collector/relaydescs/RelayDescriptorDownloader.java#n70

Let's talk more next week if this doesn't make sense. I just wanted to confirm that this is indeed an issue, but also say that it's nothing we're introducing right now.

comment:10 Changed 2 years ago by iwakeh

Well, actually the "validAfter" from the referring microconsensus is passed on. Thus, the new functionality would change something.

In general, there should be a better solution for the micro-desc handling. Couldn't a @valid-after tag be prepended to the microdescs?

Otherwise, the sync-mechanism will have to parse microcons in order to determine the appropriate date.

comment:11 Changed 2 years ago by karsten

Okay, I see what you mean. However, I'd rather want to advoid adding such a @valid-after tag, because it smells like making things more complicated than they should have to be.

Here's what we could do. We already have a list of missing microdescriptors in place for the downloader. And we already parse incoming microdescriptor consensuses to learn about microdescriptors we'll want to fetch. What we could do is: 1) always sync microdescriptor consensuses before microdescriptors, so that we learn about missing microdescriptors and their valid-after times; 2) look at the same map for sorting incoming microdescriptors into months; 3) discard microdescriptors we receive via sync that we're not missing.

1) and 2) seem doable, but let's briefly think about the impact of 3) there.

First, it seems rather unlikely that we'll run into that case very often, because we'd also sync microdescriptor consensuses from the other instance, so we should know all microdescriptors they know.

Second, the value of microdescriptors is limited for most of our use cases, and the main reason for collecting them was to facilitate debugging Tor protocols but not to analyze the Tor network which is better done with consensuses and server descriptors.

Third and last, let's keep in mind that we're improving descriptor completeness a lot with this sync approach, even if we might still be missing half a dozen microdescriptors per year.

I'd say let's take the best-effort approach with microdescriptors and call it a day. :)

comment:12 Changed 2 years ago by iwakeh

Parent ID: #20519

comment:13 Changed 2 years ago by iwakeh

Currently, metrics-lib cannot deal with the huge files resulting from vote grouping.
Reconsider after implementation of #20395.

comment:14 Changed 14 months ago by karsten

Owner: set to metrics-team
Status: needs_informationassigned

comment:15 Changed 9 months ago by karsten

Priority: HighMedium

Without reading all comments above, I believe that we decided to set priority to high before figuring out that metrics-lib cannot handle large files. Setting priority back to the default (medium). We can change it back later if that issue is resolved.

Note: See TracTickets for help on using tickets.