Opened 2 years ago

Closed 22 months ago

#20994 closed defect (fixed)

invalid first_seen timestamp on bridges

Reported by: cypherpunks Owned by: metrics-team
Priority: Medium Milestone: Onionoo 3.2-1.1.0
Component: Metrics/Onionoo Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

https://lists.torproject.org/pipermail/metrics-team/2016-December/000247.html

David Fifield wrote:
"
Some Onionoo bridges have first_seen="1970-01-01 00:00:00", while
last_seen and last_restarted are set to reasonable dates. For example,

https://onionoo.torproject.org/details?type=bridge&limit=100&fields=hashed_fingerprint,first_seen,last_seen,last_restarted
{"hashed_fingerprint":"29AC664582027DFAF94528F989951D68115252F0","last_seen":"2016-12-14 14:41:03","first_seen":"1970-01-01 00:00:00","last_restarted":"2016-11-15 15:12:39"},
{"hashed_fingerprint":"4E0727172580EE302138D3DD33A04CD23D650F0F","last_seen":"2016-12-10 08:41:03","first_seen":"1970-01-01 00:00:00","last_restarted":"2016-12-09 09:06:38"},

What does this mean? I never saw "1970-01-01 00:00:00" in the other two
date fields, only in first_seen. I would expect, based on the
documentation, that if a bridge had been seen in only one bridge network
status, that first_seen would equal last_seen.

https://onionoo.torproject.org/protocol.html#details

  • last_seen string required UTC timestamp (YYYY-MM-DD hh:mm:ss) when this bridge was last seen in a bridge network status.
  • first_seen string required UTC timestamp (YYYY-MM-DD hh:mm:ss) when this bridge was first seen in a bridge network status.

I looked briefly at the onionoo source code. The only explicit reference
to "1970-01-01 00:00:00" I found was in in
org.torproject.onionoo.docs.UpdateStatus.setFromDocumentString, when a
timestamp fails to parse.
https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/onionoo/docs/UpdateStatus.java?id=onionoo-3.1-1.0.0#n24

public void setFromDocumentString(String documentString) {
try {

this.updatedMillis = Long.parseLong(documentString.trim());

} catch (NumberFormatException e) {

log.error("Could not parse timestamp '" + documentString + "'. "

+ "Setting to 1970-01-01 00:00:00.");

this.updatedMillis = 0L;

}

}

I was able to find one of the hashed_fingerprints above in a recent
bridge status document (published 2016-12-14 09:41:03). It looks okay to
me, so I don't suspect it's a parsing issue.

29AC664582027DFAF94528F989951D68115252F0:
r Unnamed KaxmRYICffr5RSj5iZUdaBFSUvA 0PeiGiVwFHDcaLTK4UYpZQXhaYg 2016-12-14 12:57:51 10.107.79.18 53924 0
s Fast HSDir Running Stable V2Dir Valid
w Bandwidth=60
p reject 1-65535
"

Child Tickets

Change History (17)

comment:1 Changed 23 months ago by karsten

I looked more into this issue and found some interesting facts:

  • Roughly 22% of all bridges are affected.
  • 0.13% of bridges have last_seen and first_seen dates of 1970-01-01.
  • Relays are completely unaffected.
  • This started on 2014-12-09, which is the earliest last_seen date of a bridge with an invalid first_seen date of 1970-01-01.

Staring at code didn't reveal any insights, so I'll set up a new local Onionoo instance to see whether that produces first_seen dates of 1970-01-01 in the initial run or subsequent runs.

comment:2 Changed 23 months ago by karsten

The local run produced four bridges with first_seen date of 1970-01-01.

I looked a bit more, and it seems that this only affects bridges that we have server descriptors (or extra-info descriptors) for but that are not listed in any of the statuses we have.

The following code fixes the problem, but it might be a bad fix:

diff --git a/src/main/java/org/torproject/onionoo/updater/NodeDetailsStatusUpdater.java b/src/main/java/org/torproject/onionoo/updater/NodeDetailsStatusUpdater.java
index a66aba8..a8f5807 100644
--- a/src/main/java/org/torproject/onionoo/updater/NodeDetailsStatusUpdater.java
+++ b/src/main/java/org/torproject/onionoo/updater/NodeDetailsStatusUpdater.java
@@ -335,7 +335,7 @@ public class NodeDetailsStatusUpdater implements DescriptorListener,
     detailsStatus.setAdvertisedBandwidth(advertisedBandwidth);
     detailsStatus.setPlatform(descriptor.getPlatform());
     this.documentStore.store(detailsStatus, fingerprint);
-    this.updatedNodes.add(fingerprint);
+    //this.updatedNodes.add(fingerprint);
   }
 
   private void processBridgeExtraInfoDescriptor(
@@ -352,7 +352,7 @@ public class NodeDetailsStatusUpdater implements DescriptorListener,
           descriptor.getPublishedMillis());
       detailsStatus.setTransports(descriptor.getTransports());
       this.documentStore.store(detailsStatus, fingerprint);
-      this.updatedNodes.add(fingerprint);
+      //this.updatedNodes.add(fingerprint);
     }
   }

It might be that we should leave these lines in and even add similar lines for relays of which we only have server descriptors and extra-info descriptors. And then we make sure that 0L is a special value for first_seen meaning "never seen before", rather than "seen on 1970-01-01". Needs more thoughts and testing.

comment:3 Changed 23 months ago by karsten

Status: newneeds_review

Okay, I'm more confident now that the fix suggested above is sane. Here's why:

The updatedNodes data structure has the sole purpose of creating a set of all fingerprints for updateNodeDetailsStatuses(), which is step 4 of 4 in the update process.

However, it doesn't make any sense to include a fingerprint which was never seen in a consensus or bridge network status and for which we don't have a NodeStatus instance. There's nothing that the non-existent NodeStatus could contribute to the DetailsStatus, and there's no need to create a mostly empty NodeStatus for a node that was never listed in a consensus or bridge network status. In fact, we shouldn't even return the node in a query, because the only proof of its existence comes from self-published descriptors.

Regarding the updatedNodes set, we can as well go through knownNodes. In fact, at one point we're adding all knownNodes to updatedNodes anyway, so they contain the same entries. We can simply kick out updatedNodes and use knownNodes in updateNodeDetailsStatuses().

Here's what I think happened that led to this bug: We learned about a bridge from a bridge server descriptor and created a mostly-empty NodeStatus for it in updateNodeDetailsStatuses(), including a first_seen date of 1970-01-01. At that time we didn't see this bridge in any bridge network status. But some time later we did see it in a status, which is how last_seen was updated. With the fix we wouldn't have a NodeStatus until first seeing the bridge in a bridge network status.

There, the good news is that this very likely fixes the issue. The bad news is that I'll have to set up a temporary Onionoo instance and import all bridge network statuses in order to get correct first_seen dates for all bridges. I'll start doing that now, but it could easily take days or even weeks. Transitioning to the new, correct data will be fun, but there's still time to think about making that transition smoother.

Please review my branch task-20994.

comment:4 Changed 23 months ago by iwakeh

The reasoning in comment:3 makes sense; the code seems to do what is suggested, but test(s) missing.

Couldn't there also be relays affected, b/c of the cached descriptor imports?

comment:5 in reply to:  4 ; Changed 23 months ago by karsten

Replying to iwakeh:

The reasoning in comment:3 makes sense; the code seems to do what is suggested, but test(s) missing.

Cool, thanks for looking! There are indeed no tests for this class. But I'm afraid it's a small project to start writing tests for this class now, and that shouldn't block the bugfix.

Couldn't there also be relays affected, b/c of the cached descriptor imports?

Good question. The answer is that relays are not affected, because processRelayServerDescriptor() did not contain a line this.updatedNodes.add(fingerprint); as it was contained in processBridgeServerDescriptor(). We only put relay fingerprints into that set that we saw in consensuses, and that's how they all got valid first_seen timestamps.

So, good to merge?

comment:6 in reply to:  5 Changed 23 months ago by iwakeh

Status: needs_reviewmerge_ready

Replying to karsten:

Replying to iwakeh:

The reasoning in comment:3 makes sense; the code seems to do what is suggested, but test(s) missing.

Cool, thanks for looking! There are indeed no tests for this class. But I'm afraid it's a small project to start writing tests for this class now, and that shouldn't block the bugfix.

Hmm, I prefer having a coded proof of the fix. It doesn't seem such a huge task from merely looking at the class.

Well, I'll add a ticket for writing this test.

Couldn't there also be relays affected, b/c of the cached descriptor imports?

Good question. The answer is that relays are not affected, because processRelayServerDescriptor() did not contain a line this.updatedNodes.add(fingerprint); as it was contained in processBridgeServerDescriptor(). We only put relay fingerprints into that set that we saw in consensuses, and that's how they all got valid first_seen timestamps.

So, good to merge?

Given that you're currently system-testing this fix (by re-importing) ok.
Ready for merge.

comment:7 Changed 23 months ago by iwakeh

derived ticket for test is #21171

comment:8 Changed 23 months ago by karsten

Great, merged! And thanks for opening that test ticket. Leaving this ticket open until the reprocessing of bridge archives is completed and we have a plan for updating.

comment:9 Changed 22 months ago by karsten

Status: merge_readyneeds_review

Please review the last commit in my task-20994-2 branch. We'll need this change in order to fix existing Onionoo instances. The earlier change (which I tested successfully on an EC2 instance) only prevented this bug in fresh Onionoo instances. But we should be able to fix existing instances by throwing in archived bridge statuses. I'm still testing this branch locally, but I figured we can parallelize testing and review to get this merged sooner. Thanks!

comment:10 Changed 22 months ago by karsten

Okay, local tests look good. I took a backup from the main Onionoo instance and imported archived bridge statuses since 2014-12. I uploaded the status/summary file from before (31M) and after (32M) importing bridge statuses, please take a look if you want. I'll also keep a backup when updating, so that we can roll back the day or two after without any issues. Just in case. But I think we're good here, assuming that the code review does not uncover any major issues.

comment:11 in reply to:  9 Changed 22 months ago by iwakeh

Replying to karsten:

Please review the last commit in my task-20994-2 branch. We'll need this change in order to fix existing Onionoo instances.

Good idea!
Did I overlook the test for this change?

comment:12 in reply to:  10 ; Changed 22 months ago by iwakeh

Replying to karsten:

Okay, local tests look good. I took a backup from the main Onionoo instance and imported archived bridge statuses since 2014-12. I uploaded the status/summary file from before (31M) and after (32M) importing bridge statuses, please take a look if you want. I'll also keep a backup when updating, so that we can roll back the day or two after without any issues. Just in case. But I think we're good here, assuming that the code review does not uncover any major issues.

How did the 37 (mostly 'ki') bridges sneak in? (grep [-c] "1970-01-01" summary-after)

b	ki	03F7D8305D9924E9C26FA29A4DB556AA62605F3E	10.103.149.101;;	2017-01-23	08:41:04	54588	0	V2Dir,Valid	-1	??		-1	null	null	1970-01-01	00:00:00	null	null	null		null	::
b	ki	0B35992DAC9039C0C1B5F3E3277B6D52D0E01D0C	10.86.91.208;;	2017-01-23	08:41:04	59449	0	V2Dir,Valid	-1	??		-1	null	null	1970-01-01	00:00:00	null	null	null		null	::
b	ki	0CE94B36871A2A9311407C477FAC2C5F7F33C473	10.96.43.158;;	2017-01-23	08:41:04	54636	0	V2Dir,Valid	-1	??		-1	null	null	1970-01-01	00:00:00	null	null	null		null	::
b	mullbinde12	0F6EFF59CA04EC8CD2B8ED4920660B624721E24D	10.3.54.27;;	2017-01-23	08:41:04	52317	0	Fast,Running,V2Dir,Valid	-1	??		-1	null	null	1970-01-01	00:00:00	null	null	null		null	::
...

comment:13 in reply to:  12 Changed 22 months ago by karsten

Status: needs_reviewneeds_revision

Replying to iwakeh:

How did the 37 (mostly 'ki') bridges sneak in? (grep [-c] "1970-01-01" summary-after)

Oh, those are still included with date 1970-01-01, because CollecTor's bridge descriptor archives have been created a bit before I created the Onionoo snapshot for this test. Note how the examples you gave all have 2017-01-23 as last_seen date. I guess that most or all of those entries would be fixed when using newly created archives from after taking the snapshot.

Regarding tests, I'll look into writing a test for the NodeStatus change in a bit.

comment:14 Changed 22 months ago by karsten

Status: needs_revisionneeds_review

Please find this new test for the NodeStatus change.

comment:15 Changed 22 months ago by iwakeh

Milestone: Onionoo 3.2-1.1.0

comment:16 Changed 22 months ago by iwakeh

Status: needs_reviewmerge_ready

comment:17 Changed 22 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Merged, will deploy once there's a release which should happen later today. Closing. Thanks!

Note: See TracTickets for help on using tickets.