Opened 8 years ago

Closed 7 years ago

#8255 closed enhancement (implemented)

Parse exit list entry

Reported by: atagar Owned by: atagar
Priority: Low Milestone:
Component: Archived/Stem Version:
Severity: Keywords:
Cc: karsten Actual Points:
Parent ID: #8252 Points:
Reviewer: Sponsor:


Published by DNSEL or TorBEL to indicate what ip address exit relay X had at timestamp Y. Karsten mentioned on #7987 that this may be relevant for a metrics-lib replacement.

Child Tickets

Change History (9)

comment:1 Changed 8 years ago by karsten

Cc: karsten added

comment:2 Changed 7 years ago by arma

The current format is a pile of these stanzas:

ExitNode 013D5DCA8348E51616BD2665CA73DEE5120E2D45
Published 2013-07-04 12:38:33
LastStatus 2013-07-04 13:02:45
ExitAddress 2013-07-04 13:10:52

comment:3 Changed 7 years ago by arlolra

Status: newneeds_review

comment:4 Changed 7 years ago by arlolra

Up to 3 commits now, so you're better off just looking at the branch:

comment:5 Changed 7 years ago by atagar

Status: needs_reviewneeds_revision

Hi arlolra, sorry about the delay and thanks for the patch!

This looks fantastic, only a few minor questions and comments...

+Parsing for TorDNSEL files.
+Defined in

Thanks for including the link. However, we probably just need it once and it's already in TorDnsel's pydocs so just only include it there.

+  _read_until_keywords("ExitNode", tordnsel_file, skip = True)
+  while True:
+    contents = _read_until_keywords("ExitAddress", tordnsel_file)
+    contents += _read_until_keywords("ExitNode", tordnsel_file)
+    if contents:
+      yield TorDnsel(bytes.join(b"", contents), validate, **kwargs)
+    else:
+      break  # done parsing file

This will skip anything prior to the first 'ExitNode'. The TorDNSEL spec is a tad less formal than other spec types so this is probably appropriate though in most other cases extra content like this is disallowed.

The spec says entries are...

(ExitNode | Published | LastStatus | ExitAddress*)*

Which indicates seems to indicate that the ExitAddress appears zero or more times. Won't the above have problems if ExitAddress doesn't appear for an entry?

+class TorDnsel(Descriptor):

Mind if we rename this class to TorDNSEL?

+  :var list exit_addresses: list of (str address, datetime date) tuples consisting of the found exit address and the time

This should have an asterisk too since the list will always exist (it has a default value of being empty), and lets also mention that it's IPv4 addresses.

+    for keyword, values in entries.items():
+      value, _ = values[0]

Lets balk with a validation error if we get any block content.

+      elif keyword == "ExitAddress":
+        for value, _ in values:
+          address, date = value.split(" ", 1)
+          if validate and not stem.util.connection.is_valid_ipv4_address(address):
+            raise ValueError("ExitAddress isn't a valid IPv4 address: %s" % address)
+          try:
+            date = datetime.datetime.strptime(date, "%Y-%m-%d %H:%M:%S")
+          except ValueError:
+            if validate:
+              raise ValueError("ExitAddress found time wasn't parsable: %s" % value)
+          self.exit_addresses.append((address, date))

This worries me a little bit because if 'validate' is False then the date we append could be either a datetime or str. Lets continue (skip the entry) when date can't be parsed and validate is False.

+    descriptors = list(_parse_file(io.BytesIO(desc_text)))
+    self.assertEqual(3, len(descriptors))
+    self.assertTrue(isinstance(descriptors[0], TorDnsel))
+    self.assertEqual(3, len(descriptors[1].exit_addresses))

Would you mind some assertions about the parsed values? Also, a test or two with invalid content would be nice. For instance, we should probably have a test that parses content without any ExitAddress considering the issue mentioned earlier.

Cheers! -Damian

comment:6 Changed 7 years ago by arlolra

Status: needs_revisionneeds_review

Thanks for the review.

The spec does indeed say,

(ExitNode | Published | LastStatus | ExitAddress*)*

but it also says, The document consists of an arbitrary number of entries each consisting of *four* or more lines.
and *One* or more test results for this relay consisting of the found exit address and the time, in UTC, when the test was performed.

So, although it uses the regex star, I think it means plus.

(ExitNode | Published | LastStatus | ExitAddress+)*

Could you interpret it that way? Maybe we should get that updated?

The rest of your comments were implemented. Again, see the squashed branch,

comment:7 Changed 7 years ago by atagar

Looks great! TorDNSEL support pushed, thanks for the addition!

Just to confirm, are you fine with this and your future stem contributions being in the public domain? For an explanation of why I need to ask see here.

Could you interpret it that way? Maybe we should get that updated?

Certainly sounds like it. The spec starts with "This document is an attempt to make sense of the exit list format that TorDNSEL publishes without the original TorDNSEL author being around." so I suppose anything it says should be taken with a grain of salt.

comment:8 in reply to:  7 Changed 7 years ago by arlolra

Thanks for merging.

Just to confirm, are you fine with this and your future stem contributions being in the public domain?

Yup, happy to put that work in the public domain.

comment:9 Changed 7 years ago by atagar

Resolution: implemented
Status: needs_reviewclosed

Great! Resolving, feel free to reopen if you spot anything that we should tweak.

Note: See TracTickets for help on using tickets.