Opened 5 weeks ago

Last modified 3 weeks ago

#28320 accepted task

Rewrite CollecTor relaydescs module using Stem/txtorcon

Reported by: karsten Owned by: irl
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor: Sponsor13

Description

The CollecTor service collects and archives data from various nodes and services in the public Tor network. Internally, it consists of several modules that are running in the background following a pre-defined schedule. These modules either download data from other hosts or process data that has been copied from other hosts to the local file system. The processed data is then provided via a locally running static web server.

CollecTor is written in Java. It uses several APIs either provided in the JDK or in third-party libraries. For example, it uses java.util.concurrent for scheduling. However, it does not use a specific framework for batch processing. That is why it has to solve challenges like the following on its own:

  • Scheduling: Make sure modules are running, say, once per hour; avoid overlapping runs.
  • Dependencies: Make sure that module runs don't interfere with each other; one module writes newly obtained files to disk, another tars them up, yet another writes an index file and provides that to external applications.
  • Shutdowns: Handle externally triggered shutdowns gracefully and make sure the service resumes operation after reboot, without missing data.

These are just a few examples, and CollecTor does not resolve all of them in the best way possible. It also feels like somebody must have solved these challenges before. We should find out, and the best way is probably to try it out in practice.

In Mexico City we decided to evaluate existing batch processing frameworks by rewriting the CollecTor relaydescs module using Python with Stem or txtorcon. It should be sufficient to make it work for at least consensuses and server descriptors as initial proof of concept. Other descriptor types can follow later, if we decide to switch from Java to Python for CollecTor.

The first steps are to write down requirements and possible Python libraries for the batch-processing parts.

We're done with this task when we have a working prototype of CollecTor in Python that fetches consensuses and server descriptors from the directory authorities.

Child Tickets

Change History (4)

comment:1 Changed 5 weeks ago by atagar

Hi Karsten. This sounds an awful lot like DocTor (download descriptors on an hourly basis and check a series of characteristics).

That said, honestly I'm unsure you need either stem or txorcon for this. If all you want is to download descriptors won't cron and curl do the trick? Stem's benefit is that it parses descriptors and can download leveraging directory mirrors. As for txtorcon, I'm unaware of any descriptor capabilities it provides (meejah can correct me wrong but it's solely a twisted control port controller).

Would you mind further describing what you're hoping for one of these to provide? If you'd like a simple example of downloading descriptors our tutorials, demos, and doctor have examples.

comment:2 Changed 5 weeks ago by karsten

It's true that Stem/txtorcon might not be of primary interest for this prototype. But if we later want to build upon this prototype to rewrite the rest of CollecTor's relaydescs module, Stem's descriptor parsing and validation capabilities will be quite useful. I'll leave it up to irl to decide whether he wants to use Stem or txtorcon. This discussion will also be more useful as soon as the requirements are written down. Stay tuned!

comment:3 Changed 5 weeks ago by atagar

Gotcha! When you guys know what you want just let me know. I'd be happy to whip up a prototype if you'd like since the asks here thus far sound pretty simple.

comment:4 Changed 3 weeks ago by irl

Owner: changed from metrics-team to irl
Status: newaccepted
Note: See TracTickets for help on using tickets.