Opened 7 months ago

Last modified 5 months ago

#33502 new enhancement

Do not let appended descriptor files grow too large

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I revisited #20395 last week. The issue is that metrics-lib cannot handle large descriptor files, because it first reads the entire file into memory before splitting it into single descriptors and parsing them. While it would be possible to parse large descriptor files after making some major code changes (using FileChannel and doing lazy parsing), I don't think that we have to do that. After all, we're writing these large descriptor files ourselves in CollecTor, and it's up to us to stop doing that.

Going back in time, the original reason for concatenating multiple descriptors into a single file was that rsyncing many tiny files from one host to another host was just slow. So we appended server descriptors and extra-info descriptors into a single file. This works well with server descriptors or extra-info descriptors published within 1 hour or even 10 hours. It does not work that well anymore with all server descriptors or extra-info descriptors synced from another CollecTor instance when starting a new instance (#20335). It works even less well when importing one or more monthly tarballs containing server descriptors or extra-info descriptors (#27716).

My suggestion is that we define a configurable limit for appended descriptor files of, say, 20 MiB. And when storing a descriptor, we check whether appending a descriptor to an existing descriptor file would exceed this limit and start a new descriptor file in that case.

There are some technical details to work out, but I think they can be solved. I also don't expect this to produce a lot of code, not even complex code changes. The benefit would be that we could resolve #20395 and #27716 by implementing this.

Thoughts on the general idea?

Child Tickets

Change History (3)

comment:1 Changed 7 months ago by karsten

Status: assignedneeds_review

comment:2 Changed 6 months ago by karsten

Here's another option: rather than append multiple descriptors to a single flat file we could produce a tarball containing the few hundred or thousand descriptor files. Basically,

https://collector.torproject.org/recent/relay-descriptors/server-descriptors/2020-03-10-14-05-00-server-descriptors

containing 596 descriptors concatenated to a 1.4 MiB file would then be replaced by

https://collector.torproject.org/recent/relay-descriptors/server-descriptors/2020-03-10-14-05-00-server-descriptors.tar

containing 596 descriptor files.

Advantage over the approach sketched out above would be that we wouldn't have three output file formats anymore (flat file with 1 descriptor, flat file with >= 1 descriptors, tarball). Disadvantage might be that processing tarballs can be less convenient than processing flat files.

comment:3 Changed 5 months ago by karsten

Status: needs_reviewnew

irl and I just talked this over and concluded that producing tarballs is the better design here. It solves the large files issue, and it might even fix data integrity/consistency issues that just haven't surfaced yet. I'm going to write a patch for the tarball idea some time in the next weeks. Thanks!

Note: See TracTickets for help on using tickets.