Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#8050 closed enhancement (implemented)

Stem's DescriptorReader should provide an option to provide statuses vs. status entries

Reported by: karsten Owned by: atagar
Priority: Medium Milestone:
Component: Core Tor/Stem Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When I pass a tarball of consensuses to Stem's DescriptorReader, it gives me an iterator over status entries, though I'd expect an iterator over statuses. I see the advantages of returning status entries rather than waiting until a full status is parsed. But for most use cases I'm interested in, I want the status and then maybe look into status entries.

For example, I might want to extract supported consensus versions or bandwidth weights over time; no need to look into status entries for that. The alternative, to iterate over status entries and look at every referenced status document to see if I saw that before or not, seems complicated. It probably doesn't even work for bandwidth weights which are parsed after the status entries.

Can we have a parameter in DescriptorReader to specify whether it should provide top-level documents or subdocuments? I'd even argue that top-level documents should be the default, because the DescriptorReader will mostly be used for batch processing where latency and memory consumption are not an issue. But I can see how changing the default might make other people unhappy.

Child Tickets

Change History (2)

comment:1 Changed 4 years ago by atagar

  • Resolution set to implemented
  • Status changed from new to closed

Hi Karsten. I just pushed something that should make everyone happy...

https://gitweb.torproject.org/stem.git/commitdiff/ea0b73a5aa221fadafc2ba718a0ef42e151e5ad6

The DescriptorReader and parse_file() now have a 'document_handler' argument that has three options:

  • give me router status entries
  • give me a document with the router status entries
  • give me a document *without* reading the router status entries

https://stem.torproject.org/api/descriptor/descriptor.html#stem.descriptor.__init__.DocumentHandler

To use this simply provide one of the enum values. For instance...

from stem.descriptor import parse_file, DocumentHandler

with open('/path/to/my/cached-consensus') as document_file:
  document = next(parse_file(document_file, "network-status-consensus-3 1.0", document_handler = DocumentHandler.DOCUMENT))
  print "document version %i, had %i routers" % (document.version, len(document.routers))

The 'next()' call is because parse_file() gives you an iterator, in this case containing a single value that's a NetworkStatusDocumentV3 instance.

Feel free to reopen if this isn't what you wanted.

The alternative, to iterate over status entries and look at every referenced status document to see if I saw that before or not, seems complicated.

Not really. The documents all had the same reference so you could have simply kept a set...

seen_documents = set()

for entry in my_descriptor_reader:
  if not entry.document in seen_documents:
    seen_documents.add(entry.document)

    ... do stuff...

It probably doesn't even work for bandwidth weights which are parsed after the status entries.

As mentioned in our email exchange this is wrong. It reads the header and footer, *then* the router status entries in the middle.

Cheers! -Damian

comment:2 Changed 4 years ago by karsten

Looks awesome! I'm mostly interested in the ability to use DescriptorReader with the new document handler. Here's what I did and what worked just fine:

from stem.descriptor import DocumentHandler
from stem.descriptor.reader import DescriptorReader

with DescriptorReader('in/consensuses-2013-01/',
    document_handler=DocumentHandler.DOCUMENT) as reader:
  for document in reader:
    print "document version %i, had %i routers" % (
        document.version, len(document.routers))

Thanks!

Note: See TracTickets for help on using tickets.