Opened 6 years ago

Closed 6 years ago

#4439 closed task (implemented)

Develop a Java/Python API that wraps relay descriptor sources and provides unified access to them

Reported by: karsten Owned by: atagar
Priority: Medium Milestone:
Component: Core Tor/Stem Version:
Severity: Keywords:
Cc: atagar Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Quite a few metrics tools are processing archived and current relay descriptors to provide aggregate statistics, make descriptor archives searchable, or monitor the Tor network. These tools have a non-trivial amount of code in common that imports relay descriptors from various sources. Copying code is bad. Let's write an API that all these metrics tools can use and that facilitates developing new tools.

Note that this API is different from existing Tor controller APIs which connect to a Tor's control port and provide descriptors that the Tor process knows about. The new API won't connect to a Tor control port (even though it would be possible, but it's not required), but it may read the cached descriptors from a Tor's data directory, along with importing relay descriptors from other sources. Of course, the two APIs can be combined, but there's also a reason for the API described here to exist separately. None of the metrics tools requires to control a Tor process.

There are two major sources for relay descriptors:

  • Local directories: We can read relay descriptors from the cached-* files of a local Tor data directory or from the output directory of the directory-archive script or metrics-db. Some of these local directories can grow quite large, so that we'll need an efficient way to exclude descriptors that we already know. Also, some files contained in these directories may contain multiple relay descriptors while others don't. We'll want to support an arbitrary number of local directories in the new API.
  • Directory authorities/mirrors: We can download relay descriptors from the directory authorities or directory mirrors via Tor's directory protocol. We should restrict downloads to the minimum and only download missing descriptors. We should also download compressed descriptors if possible. In some cases we're interested whether a directory authority serves a descriptor (e.g., consensus-health script). In most cases we want to set a timeout for downloading descriptors.

We should design the new API in a way that it's stateless with respect to different executions and that it doesn't have its own configuration. A tool that uses the API should first initialize the API by creating relay descriptor data sources and then requesting descriptors to process.

The following tools may use the new API once it's ready: metrics-db, the part of metrics-web that aggregates statistics, the ExoneraTor database, the relay search database, the consensus-health script, the descriptor-health script, and the basic monitoring infrastructure.

Child Tickets

Attachments (1)

descriptor.tar.bz2 (53.2 KB) - added by karsten 6 years ago.
Tarball containing an Initial, non-functional version of the API

Download all attachments as: .zip

Change History (19)

comment:1 Changed 6 years ago by atagar

I would be interested in discussing this with you during the winter developer meeting. One thing that I'd like (and maybe it's tangential to this) is a library I can run locally that provides a facade with equivalents to the GETINFO ns/*, desc/*, and microdescriptor options (the last pending ticket 3832). This library could then read from the control port, cached-* files, or the metrics service.

At present we tell scripts that just need consensus information to run a tor client with 'FetchDirInfoEarly' which does the trick, but it seems like we can come up with a better and lighter weight alternative to subscribe for getting the latest consensus.

This would be a start of the metrics service API, with obvious next steps being to fetch historical consensus data or aggregated statistics.

Cheers! -Damian

comment:2 in reply to:  1 Changed 6 years ago by karsten

Replying to atagar:

I would be interested in discussing this with you during the winter developer meeting.

Happy to talk to you earlier than that. I really hope we're done with the first version of this API long before the dev meeting, or I'll have to copy the same code a couple more times.

One thing that I'd like (and maybe it's tangential to this) is a library I can run locally that provides a facade with equivalents to the GETINFO ns/*, desc/*, and microdescriptor options (the last pending ticket 3832).

Not tangential. I think that's pretty much what the API should do.

This library could then read from the control port, cached-* files, or the metrics service.

cached-* files are fine, as well as directory authorities/mirrors that I listed in the ticket description.

What's the advantage of talking to the control port if we can read the cached-* files? Shouldn't be hard to add this as a third data source, though.

By metrics service you mean the metrics database? This will have to wait until we have a good relay descriptor database. See #2922 and #4440 for my thoughts on that. I agree that adding a database as the fourth data source would be a neat extension of the API.

At present we tell scripts that just need consensus information to run a tor client with 'FetchDirInfoEarly' which does the trick, but it seems like we can come up with a better and lighter weight alternative to subscribe for getting the latest consensus.

Yes. That would be downloading from the directory authorities/mirrors. A tricky part will be to explain to the users of this API which data source to use when.

This would be a start of the metrics service API, with obvious next steps being to fetch historical consensus data or aggregated statistics.

Historical consensus data are fine.

But aggregated statistics? Hmmmmmm. That means we'll have to specify what aggregated statistics there are in the database somewhere. I like the idea. It's something we'll have to postpone a bit though.

So, great! That's really some good input on the idea of writing such an API. Let's write it. :)

comment:3 Changed 6 years ago by atagar

Not tangential. I think that's pretty much what the API should do.
By metrics service you mean the metrics database?

... yikes. I just realized that I had *completely* misunderstood what you're proposing. Probably for the better though since this sounds related to functionality I was already planning to write.

Stem will need functions and tests for parsing consensus/descriptor/microdescriptor data into developer friendly objects. This was going to be abstracted into a general Relay class that lazy loads ns/desc information as needed (with a method for triggering eager loads). I'd be more than happy to write functions and integ tests to alternatively fetch them from the cache or authorities/mirrors. However, this work would be a month or two out for me (still busy with shoring up the utils and testing).

Completely ignore the rest of my last comment - for some reason I'd assumed that this would be a service API for querying the metrics db externally.

Cheers! -Damian

comment:4 Changed 6 years ago by karsten

Can you be more precise what parts you misunderstood? I don't think we're talking about completely distinct functionality, are we? Sure, your plans to parse descriptors into developer-friendly objects and your idea to query the metrics database are not what I'd start with. But that doesn't mean it wouldn't make sense to have that in the API. Actually, I have that code in various metrics tools, and I'd like to move it to a single API, too. It's just a question where to start.

comment:5 Changed 6 years ago by atagar

Can you be more precise what parts you misunderstood?

Since this was spawned by the alarming infrastructure ticket I thought in the initial comment that we were talking about an RPC for services (like the alarms) to request information from the metrics hosts, and that this was an API for that. So disregard, it's obvious that has nothing to do with this. :)

I don't think we're talking about completely distinct functionality, are we?

Hmmm... now I think we've just had another misunderstanding. I'm saying that it *is* related functionality and I'd be happy to hack on it as part of stem.

My plan was to have Relay objects which are a composite of three things...

  • fingerprint (constructor arg, always there)
  • consensus and descriptor data (lazily loaded, throws an exception or returns a default value if it can't be loaded)
Relay
|- __init__(fingerprint, raise_exc = False, default = None)
|- load_consensus() - eager fetch for consensus data, returning a boolen for if it succeeds or not
|- load_descriptor() - same for the descriptor
|- fingerprint()
|- exit_policy()
|- contact_info()
+- ... etc, getters for the union of the descriptor and consensus

created() => unix timestamp for when the currently accessible consensus was created
valid_until() => unix timestamp for when this consensus expires
get_relays() => list of all Relay instances
get_relay(fingerprint) => provides Relay instance for the given fingerprint
get_relay_dest(ip_address, port) => provides Relay at the given ip/port
... probably a few other things I haven't thought of yet...

It sounds like we then simply need a factory for the data source that has those methods. Something like...

class ConsensusFetcher:
  """
  Abstract parent for factories that retrieve consensus data.
  """

  def created(self): pass
  def valid_until(self): pass
  def get_relays(self): pass
  def get_relay(fingerprint): pass
  def get_relay_dest(ip_address, port): pass

class CacheFetcher(ConsensusFetcher):
  def __init__(path):
    """
    Retrieves consensus data from the local filesystem, via cached consensus
    files. This raises an IOError if we're unable to read the given data
    directory.
    """

    if not os.path.exists(path): raise IOError("%s doesn't exist" % path)
    # etc for implementation details

class ControlFetcher(ConsensusFetcher):
  def __init__(control_connection):
    # ... similar for the control connection. This is important because there
    # could be instances where we don't have read access to tor's data
    # directory, but can access the control socket.

class DirectoryServerFetcher(ConsensusFetcher):
  def __init__(address, port):
    # ... similar for fetching directly from a directory authority or mirror.
    # This one would have a little more options compared to the others...

  def is_current(self):
    # True if we're working from the most recent consensus, False otherwise.

  def fetch(self):
    # Retrieves the new consensus, raising an IOError if unable to do so.

Does that jive with what you were thinking of?

comment:6 in reply to:  5 Changed 6 years ago by karsten

Replying to atagar:

Since this was spawned by the alarming infrastructure ticket I thought in the initial comment that we were talking about an RPC for services (like the alarms) to request information from the metrics hosts, and that this was an API for that. So disregard, it's obvious that has nothing to do with this. :)

Ah okay. :)

Hmmm... now I think we've just had another misunderstanding. I'm saying that it *is* related functionality and I'd be happy to hack on it as part of stem.

Makes sense.

My plan was to have Relay objects which are a composite of three things...

  • fingerprint (constructor arg, always there)
  • consensus and descriptor data (lazily loaded, throws an exception or returns a default value if it can't be loaded)

I haven't thought much about an API for the parsed directory objects yet. My main focus was on the different data sources and how we would access them most efficiently. For example, if we have a local Tor data directory, we may only want to learn about new descriptors since we last asked. Or if we download descriptors from the directory authorities, we may only want to download those that we don't already know.

My plan was to start without parsing descriptors at all and simply hand out raw descriptor strings. It does make sense to add the parsing code to the API, too, but that was my step two.

When you say you want to hack on this as part of stem, does that mean that stem will be able to handle non-Tor-control-port data sources? That's what I'm most interested in.

How do we proceed? Should we start with writing a design document describing the scope of the new API? How about we start this in a task-4439 directory in the metrics-tasks Git repository? We can always move it to its own Git repository once it evolves.

comment:7 Changed 6 years ago by atagar

My plan was to start without parsing descriptors at all and simply hand out raw descriptor strings.

Agreed that is a good approach. I was gonna start in the opposite order because I needed parsing code for the control port content, but they're both independent chunks.

When you say you want to hack on this as part of stem, does that mean that stem will be able to handle non-Tor-control-port data sources?

Yes. The plans for stem already go beyond just talking with the control port to include things like torrc templating and system utils for querying the tor process' pid, cwd, etc (actually I'm kinda cheating since I'd already written most of this for arm).

How do we proceed?

Like I said I'll be busy with the basic functionality of stem for the next couple months. After that I'll be doing the parsing bits, then would be interested in tackling the alternate data sources. If you'd like to do that beforehand in java then that would certainly be a help (especially for pulling directly from directory authorities and mirrors) since then I could simply translate your implementation.

Should we start with writing a design document describing the scope of the new API?

I'd rather see this grow organically, then use it for some of our own projects to mature the API before declaring it ready for external use. We can start with a design document if you'd like, but it's not necessary in my opinion.

How about we start this in a task-4439 directory in the metrics-tasks Git repository?

You mean for the java implementation? I suppose it's fine for the java part of this library to live in metrics and the python implementation to be in stem.

comment:8 Changed 6 years ago by karsten

I finished an initial, non-functional version of the Java API this weekend. I'd love to hear your thoughts on the design, atagar.

From the README:

DescripTor is a Java API that makes various Tor descriptor types available
for statistical analysis and for building services and applications.

The descriptor types supported by DescripTor include relay and bridge
descriptors which are part of Tor's directory protocol as well as Torperf
data files, GetTor statistics files, and TorDNSEL's exit lists.  Access to
these descriptors is unified to facilitate access to publicly available
data about the Tor network.

This API is designed for Java programs that process Tor descriptors in
batches.  A Java program using this API first sets up a descriptor source
by defining where to find descriptors and which descriptors it considers
relevant.  The descriptor source then makes the descriptors available in a
descriptor store.  The program can then query the descriptor store for the
contained descriptors.  Changes to the descriptor sources after
descriptors are made available in the descriptor store will not be
noticed.  This simple programming model was designed for periodically
running, batch-processing applications and not for continuously running
applications that rely on learning about changes to an underlying
descriptor source.

I could imagine that a good order to read the API is to start with the example applications in org.torproject.descriptor.example and looking at the relevant interfaces in org.torproject.descriptor whenever they're referenced from the examples.

I'm attaching the code tarball to this ticket. Once there's a Git repository, I'll move the code there. But what's a good name for this API and the repository? I came up with "DescripTor," but maybe there's something better? (I fear the English language may run out of words ending in -tor...)

Changed 6 years ago by karsten

Attachment: descriptor.tar.bz2 added

Tarball containing an Initial, non-functional version of the API

comment:9 Changed 6 years ago by atagar

that makes various Tor descriptor types available for statistical analysis and for building services and applications.

The main value of this library is being able to pull consensus information via several methods without a tor instance, yes? This description seems to focus more on the use cases you're planning rather than what the library does. Maybe alternatively phrase this as "that directly fetches consensus information from a variety of sources like cached descriptors and directory authorities/mirrors."

include relay and bridge descriptors

How does a bridge descriptor differ from relays? I don't think that I've ever dealt with them.

The descriptor source then makes the descriptors available in a descriptor store.

Ahhh interesting, I hadn't thought of eagerly loaded consensus snapshots. As the last sentence mentions this works well for batch jobs, being simpler for callers and more faithful to how consensuses are published. However, it also comes with the drawbacks of a lengthy initialization and high memory usage. Can we follow an iteration or callback pattern instead so we can process the descriptors as they come in (and free the memory)?

I had been planning on an api where we lazy load and cache descriptors requested by our caller. That would have been better for some use cases, but certainly not when we want to process the majority of the consensus. It might also not be realistic with how we can fetch consensuses information when detached from the control socket (I assume when dealing with authorities and cached consensuses fetching is more of an all-or-nothing operation).

Cheers! -Damian

comment:10 in reply to:  9 Changed 6 years ago by karsten

Replying to atagar:

The main value of this library is being able to pull consensus information via several methods without a tor instance, yes? This description seems to focus more on the use cases you're planning rather than what the library does. Maybe alternatively phrase this as "that directly fetches consensus information from a variety of sources like cached descriptors and directory authorities/mirrors."

Good idea, added your sentence before my first sentence.

How does a bridge descriptor differ from relays? I don't think that I've ever dealt with them.

See https://metrics.torproject.org/formats.html#bridgedesc .

Ahhh interesting, I hadn't thought of eagerly loaded consensus snapshots. As the last sentence mentions this works well for batch jobs, being simpler for callers and more faithful to how consensuses are published. However, it also comes with the drawbacks of a lengthy initialization and high memory usage. Can we follow an iteration or callback pattern instead so we can process the descriptors as they come in (and free the memory)?

I agree that potentially lengthy initialization and high memory usage may be problematic. In fact, I started with a callback pattern, but discarded that because it's more difficult to use for some applications. On second thought that doesn't apply to all applications. For example, the consensus-health checker needs to have all consensuses and votes available before it can do anything; it would essentially have to implement a descriptor store itself. But the metrics-web database importer could easily implement a listener and start importing once the first descriptor arrives.

How about we implement both the descriptor store and a callback pattern?

How would the iteration pattern look like? Do you have an example?

I had been planning on an api where we lazy load and cache descriptors requested by our caller. That would have been better for some use cases, but certainly not when we want to process the majority of the consensus. It might also not be realistic with how we can fetch consensuses information when detached from the control socket (I assume when dealing with authorities and cached consensuses fetching is more of an all-or-nothing operation).

I'm not entirely sure what you mean here. Fetching a consensus is an all-or-nothing operation when downloading via the directory protocol. Also, requests for multiple server descriptors or extra-info descriptors should be combined in a single request to reduce the download overhead.

comment:11 Changed 6 years ago by atagar

For example, the consensus-health checker needs to have all consensuses and votes available before it can do anything

I favor the iterator for that reason - callers that want everything buffered can read everything into a list (simple to do with both python and java).

How about we implement both the descriptor store and a callback pattern?

The callback is bad because you're having the handler block reads, and stores are bad for the reasons mentioned earlier. If we went with an iterator then it would be the best of both worlds: unblocked reads, limited memory usage if the handler is faster than reads, and can be converted into a store too. The only advantage to a callback is that it would guarantee constant memory usage (if your handlers slow then you could consume as much memory as your buffer size which would probably be unbouned). On second thought that would be likely to come up when reading local cached descriptors... lets do both.

How would the iteration pattern look like? Do you have an example?

Iterator would just be a simple producer/consumer. The producer thread adds descriptors to a buffer as they're read and the consumer pops elements off and provides them to the caller (blocking if there's no input). Iirc this would be handled in both python and java by a synchronized queue (I forget the class... java.util.concurrent.BlockingQueue?).

I'm not entirely sure what you mean here.

Requesting descriptors via the control socket can be for individual relays. I was thinking there may be some counterpart for 'give me descriptor for fingerprint X' via directory mirrors and authorities but on second thought tor wouldn't use that so it would be odd if that capability existed. Oh well...

comment:12 in reply to:  11 ; Changed 6 years ago by karsten

Replying to atagar:

I favor the iterator for that reason - callers that want everything buffered can read everything into a list (simple to do with both python and java).

Right. Makes sense.

The callback is bad because you're having the handler block reads,

Oh, right, haven't thought of that.

and stores are bad for the reasons mentioned earlier. If we went with an iterator then it would be the best of both worlds: unblocked reads, limited memory usage if the handler is faster than reads, and can be converted into a store too. The only advantage to a callback is that it would guarantee constant memory usage (if your handlers slow then you could consume as much memory as your buffer size which would probably be unbouned). On second thought that would be likely to come up when reading local cached descriptors... lets do both.

We could even suspend adding new descriptors to the queue if the handler is slow. That would work both for downloads and for reading from disk.

And we could implement descriptor parsing on demand, that is, when a handler runs the first getter of a descriptor they received from the queue. That would save quite some memory, too.

But! These are ideas to optimize something that's not even there.

I'd like to start with a single pattern. We can always make it more complex later on.

Iterator would just be a simple producer/consumer. The producer thread adds descriptors to a buffer as they're read and the consumer pops elements off and provides them to the caller (blocking if there's no input). Iirc this would be handled in both python and java by a synchronized queue (I forget the class... java.util.concurrent.BlockingQueue?).

Cool. I think I like that pattern most. (Let me update the API and example applications, and hopefully I'll still like it afterwards.)

Requesting descriptors via the control socket can be for individual relays. I was thinking there may be some counterpart for 'give me descriptor for fingerprint X' via directory mirrors and authorities but on second thought tor wouldn't use that so it would be odd if that capability existed. Oh well...

Well, you can ask for the descriptor for fingerprint X. But the better approach is to ask by descriptor ID, not by fingerprint. And it's better to ask for more than one descriptor at a time, because it causes less overhead for the directory. When you're bored, look at dir-spec.txt and search for "http" to see what fancy things the directory protocol allows you to do.

Anyway, let's focus on the iterator idea first.

I'll ask Sebastian to create two personal "DescripTor" repositories for us. That way you can make changes to the code or documentation and tell me to pull them, rather than having to describe your suggested changes here. And once we agree on a project name, we can create an official repository.

comment:13 in reply to:  12 Changed 6 years ago by karsten

Replying to karsten:

I'll ask Sebastian to create two personal "DescripTor" repositories for us. That way you can make changes to the code or documentation and tell me to pull them, rather than having to describe your suggested changes here. And once we agree on a project name, we can create an official repository.

And here's my public repository. (Thanks, Sebastian!)

comment:14 Changed 6 years ago by karsten

Status: newneeds_review

The DescripTor code has moved to the official metrics-lib repository. The descriptor-parsing parts for consensuses and votes are tested and quite robust. The downloader and reader are still somewhat fragile and not tested very much, other than by running the consensus-health checker.

I would appreciate some code review here. Any feedback like comments on code style, documentation fixes, refactoring suggestions, potential bugs, real bugs, etc. would be very welcome. The library is not finished, but it's ready for feedback. A patch with comments that I can work off would be perfect.

comment:15 Changed 6 years ago by karsten

Resolution: implemented
Status: needs_reviewclosed

We now have a library that does what the ticket description states, at least for the Java part. Leaving the ticket open in the hope that someone comes along to review the code hasn't been very successful so far. Closing.

comment:16 Changed 6 years ago by atagar

Component: Metrics UtilitiesStem
Resolution: implemented
Status: closedreopened

Reopening and assigning to stem to track the python counterpart for this change. I won't be able to get to this for a while, but it is on the project's todo list...
https://trac.torproject.org/projects/tor/wiki/doc/stem#Projects

comment:17 in reply to:  16 Changed 6 years ago by karsten

Owner: changed from karsten to atagar
Status: reopenedassigned

Replying to atagar:

Reopening and assigning to stem to track the python counterpart for this change. I won't be able to get to this for a while, but it is on the project's todo list...
https://trac.torproject.org/projects/tor/wiki/doc/stem#Projects

Sounds good. Re-assigning to you.

comment:18 Changed 6 years ago by atagar

Resolution: implemented
Status: assignedclosed

Large merge this last week (bc0e578) progressed this quite a bit, adding classes and tests for the descriptor reader and server descriptors. I'm gonna resolve this ticket since it has grown to a point of being unwieldy, and continuing work will be assigned more narrowly focused tickets.

Cheers! -Damian

Note: See TracTickets for help on using tickets.