Opened 5 years ago

Closed 5 years ago

#10725 closed enhancement (duplicate)

Write a Completely Spec-Compliant Bridge Descriptor Parser

Reported by: sysrqb Owned by: isis
Priority: Medium Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Keywords: bridgedb-parsers
Cc: isis, atagar, sysrqb Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This would be very useful in many contexts (and it's inexistance has been noted recently on numerous occassions). Now, someone should write it and we should decide where it should live. Is it a stand-alone module? Is it integrated into Stem? Do we maintain both so we can appease the small scripts that don't need Stem?

Note: This should support both sanitized and unsanitized descriptors.

Adjust the Component when something is decided.

Child Tickets

Change History (15)

comment:1 Changed 5 years ago by atagar

Dumb question perhaps, but which spec exactly are we talking about? A bridge's server descriptor or something else?

comment:2 Changed 5 years ago by sysrqb

Not a dumb question at all. hmm, it's been a while since I looked at the api but it looks like everything we need is already supported (excellent!), including the non-standard networkstatus - maybe we should write a spec for it so that a parser can be spec compliant :) I was under the impression we needed more. isis, am I missing something? Maybe this is simply a dupe of #9380? Maybe we should see if there are any problems with parsing the unsanitized versions of the three docs and provide patches, if needed.

comment:3 Changed 5 years ago by atagar

maybe we should write a spec for it so that a parser can be spec compliant

Server descriptors already have a spec. Sanitized bridge server descriptors (which is what stem's BridgeDescriptor class represents) is simply that but with a router-digest and method to check for scrubbing issues (metrics bridge descriptor specification).

Again, if you're actually talking about a different descriptor type then let me know.

comment:4 Changed 5 years ago by sysrqb

sorry, I guess I wasn't clear, I was referring to the networkstatus document, because it's not exactly a v2 networkstatus.

comment:5 Changed 5 years ago by sysrqb

As an initial example, Stem currently expects that the 'published' line is the first line of the ns, but it is actually the second line:

@type bridge-network-status 1.0
published 2010-12-27 22:07:03

I'll try to open tickets and provide patches when I have any problems.

comment:6 Changed 5 years ago by atagar

I think this is a mistake. Stem handles the @type annotations. In fact, Karsten added them on my request. :P

If you go through either parse_file() or the descriptor reader (which are the two ways you should be getting descriptors) then the @type annotation is used to determine the document type...

https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/descriptor/__init__.py#l79

comment:8 in reply to:  6 Changed 5 years ago by sysrqb

Replying to atagar:

I think this is a mistake. Stem handles the @type annotations. In fact, Karsten added them on my request. :P

If you go through either parse_file() or the descriptor reader (which are the two ways you should be getting descriptors) then the @type annotation is used to determine the document type...

Except for when this is specifically checked before _parse_file() is called. :(
https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/descriptor/networkstatus.py#l1460

    document_file = io.BytesIO(raw_content)
    published_line = stem.util.str_tools._to_unicode(document_file.readline())

    if published_line.startswith("published "):
      ...
    elif validate:
      raise ValueError("Bridge network status documents must start with a 'published' line:\n%s" % stem.util.str_tools._to_unicode(raw_content))

comment:9 Changed 5 years ago by atagar

I'm not sure what you're trying to say. Exactly what code calling stem with what input is failing?

comment:10 Changed 5 years ago by karsten

I also believe that stem can already parse sanitized bridge network statuses. (Though I didn't try that yet.)

But regarding unsanitized bridge network statuses, please note that they contain neither the @type annotation nor a "published" line. metrics-db adds those published lines to sanitized statuses to make them easier to handle. So, I guess stem shouldn't break if there's no published line, but simply not provide the publication time in the parsed object.

comment:11 in reply to:  9 Changed 5 years ago by sysrqb

Replying to atagar:

I'm not sure what you're trying to say. Exactly what code calling stem with what input is failing?

It's completely possible I'm doing something wrong.

from stem.descriptor.networkstatus import BridgeNetworkStatusDocument
...
    with open(fn_ns, 'r') as fd_ns:
      ns_content = fd_ns.read()
    bridge_ns = BridgeNetworkStatusDocument(ns_content)

Where fn_ns is the filename of the networkstatus in the current directory. This yields:

Traceback (most recent call last):
  File "bridge_attributes.py", line 366, in <module>
    bridge_ns = BridgeNetworkStatusDocument(ns_content)
  File "/home/sysrqb/.virtualenvs/stem/lib/python2.7/site-packages/stem/descriptor/networkstatus.py", line 1466, in __init__
    raise ValueError("Bridge network status documents must start with a 'published' line:\n%s" % stem.util.str_tools._to_unicode(raw_content))
ValueError: Bridge network status documents must start with a 'published' line:
@type bridge-network-status 1.0
published 2014-01-20 19:37:04
r Unnamed ADXqKmHijTlfCArKIkRTlJDnCVA LKE/QQE82+iECLEEYsTTp6L7EUw 2014-01-20 17:05:06 10.243.41.56 443 0
...

(It actually prints the entire document, but I won't paste it)

So what I was trying to say was that it looks like BridgeNetworkStatusDocument only reads the first line of the BytesIO object and fails to validate because the first lines does not start with 'published '. Am I providing incorrect input?

comment:12 Changed 5 years ago by atagar

Am I providing incorrect input?

Yup. The descriptor classes aren't intended for direct use by users. Something like the following should do the trick...

from stem.descriptor import parse_file

bridge_ns = parse_file(fn_ns).next()

The parse_file() function gives you an iterator of the descriptors in the file, hence the next() if it just contains one.

comment:13 in reply to:  12 Changed 5 years ago by sysrqb

Replying to atagar:

Am I providing incorrect input?

Yup. The descriptor classes aren't intended for direct use by users.

Thanks for helping, sorry I misunderstood. parse_file() works great. Now that I know how to use this I'll test the unsanitized docs on ponticum and it it works.

Thanks again.

comment:14 Changed 5 years ago by isis

Cc: sysrqb added
Owner: set to isis
Status: newaccepted

comment:15 Changed 5 years ago by isis

Resolution: duplicate
Status: acceptedclosed

Marking as a duplicate of #9380 and noting it there.

Note: See TracTickets for help on using tickets.