Opened 3 years ago

Closed 4 months ago

Last modified 3 months ago

#16513 closed enhancement (fixed)

Make writing of the out/ directory from the status/ directory deterministic

Reported by: karsten Owned by: iwakeh
Priority: Very High Milestone: Onionoo 2.0.0
Component: Metrics/Onionoo Version:
Severity: Normal Keywords: metrics-2018
Cc: Actual Points:
Parent ID: Points:
Reviewer: iwakeh Sponsor:

Description

#13600 introduced three new execution modes with a separate mode for writing files in the out/ directory. However, a newly written out/ directory may differ from a previous out/ directory even if the status/ directory has not changed. The main (or only?) reason is that documents containing graphs only contain the past 3 days, 1 week, etc. of data ending at the current system time when writing documents. We could change that and write the last known 3 days, 1 week, etc. of data to make the process of writing the out/ directory from data in the status/ directory more deterministic. We should try changing that to see if there are other differences. If it's easy, let's do it.

Child Tickets

Change History (49)

comment:1 Changed 3 years ago by karsten

Two related thoughts:

  • Include exit list entries if they are more than 24 hours older than the last network status consensus containing the relay, rather than 24 hours prior to system time.
  • Make first_seen_days and last_seen_days parameters independent of system time by using publication time of latest network status instead of current system time. This has the advantage that results don't suddenly change without relays_published or bridges_published changing at the same time.

comment:2 Changed 10 months ago by iwakeh

Severity: Blocker

This includes comparing the various documents from different instances after deployment in order to find out if all is as 'deterministic' as expected.

comment:3 Changed 10 months ago by iwakeh

Milestone: Onionoo-1.4.0

comment:4 Changed 10 months ago by iwakeh

Milestone: Onionoo-1.4.0Onionoo-1.5.0

comment:5 Changed 9 months ago by iwakeh

Milestone: Onionoo-1.5.0Onionoo-1.6.0

Move to next milestone in order to accommodate interim release.

comment:6 Changed 9 months ago by karsten

Keywords: metrics-2018 added

comment:7 Changed 9 months ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:8 Changed 9 months ago by karsten

Owner: set to metrics-team
Status: newassigned

comment:9 Changed 9 months ago by iwakeh

Milestone: Onionoo-1.6.0Onionoo-1.7.0

comment:10 Changed 7 months ago by iwakeh

Milestone: Onionoo-1.7.0Onionoo-1.8.0

comment:11 Changed 7 months ago by karsten

Status: assignedneeds_review

I already gave this a try a few months ago, and yesterday I went through that code again and improved it quite a bit. I think it's ready for review now. Please take a careful look at my task-16513 branch.

Note that I did not consider the two related thoughts stated above. Those are not directly related to writing the files in out/. Still worth considering, but as part of another (new) ticket.

comment:12 Changed 7 months ago by iwakeh

Milestone: Onionoo-1.8.0Onionoo-2.0.0

comment:13 Changed 7 months ago by iwakeh

Reviewer: iwakeh

Adding these to my reviewing list.

comment:14 Changed 6 months ago by iwakeh

Keywords: metrics-2018 added; metrics-2017 removed

Will be completed in 2018.

comment:15 Changed 5 months ago by karsten

Status: needs_reviewneeds_revision

I'm going to revise my branch based on an idea I had yesterday related to #24831: let's use the time when we last saw a relay in a consensus (and a bridge in a status) to determine which graphs to produce. This has advantages over other approaches:

  • It does not depend on system time, which is what we're trying to resolve here.
  • It does not rely on statistics reported by the relay or bridge, which breaks if a relay or bridge has a broken clock, and which may lead to different documents (bandwidth, weights, etc.) showing somewhat different time periods.

Branch follows, but I'll have to implement and test the idea first.

comment:16 Changed 5 months ago by iwakeh

Severity: BlockerNormal

'blocker' seems not appropriate here. -> setting to 'normal'

comment:17 Changed 5 months ago by karsten

Priority: MediumHigh
Status: needs_revisionneeds_review

Please review commit 0de5770 in my task-16513-2 branch. Setting priority to high, because I'd like to base my #24831 work on this branch (or a revised version of it), and it would be cool to move forward with that while having a possible plan in mind. (Thanks!)

comment:18 Changed 5 months ago by iwakeh

Status: needs_reviewneeds_revision

Test failure:

    [junit] Testcase: testIgnoreFuture took 0.392 sec
    [junit] 	FAILED
    [junit] expected:<1> but was:<0>
    [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
    [junit] 	at org.torproject.onionoo.writer.BandwidthDocumentWriterTest.testIgnoreFuture(BandwidthDocumentWriterTest.java:74)

comment:19 in reply to:  18 Changed 5 months ago by karsten

Status: needs_revisionneeds_review

Replying to iwakeh:

Test failure:

Uhm, yes. Please find the fixup commit that I just pushed to the same branch. (I changed that test a couple of times, and it looks like it ended up being in a mixed state at the end.)

comment:20 Changed 5 months ago by iwakeh

Ah, tests work now; continuing review, thanks.

comment:21 Changed 5 months ago by iwakeh

Status: needs_reviewmerge_ready

All tests and checks pass and code looks ok.

Shouldn't the new method parameter 'long now' (in various methods) be named differently to reflect the meaning of the milliseconds timestamp? Maybe 'long lastMeasurementMillis' or similar?

How will the data be affected when deploying the fix? The stored data timestamps might be more recent than the last seen value used in the new code. Did you run system tests on a local instance?
This should be clarified before deployment, but the code is merge ready.

comment:22 in reply to:  21 Changed 5 months ago by karsten

Replying to iwakeh:

All tests and checks pass and code looks ok.

Awesome!

Shouldn't the new method parameter 'long now' (in various methods) be named differently to reflect the meaning of the milliseconds timestamp? Maybe 'long lastMeasurementMillis' or similar?

Good point. Let's go with lastSeenMillis which is what we use in NodeStatus, too. I'll change that before merging.

How will the data be affected when deploying the fix? The stored data timestamps might be more recent than the last seen value used in the new code. Did you run system tests on a local instance?
This should be clarified before deployment, but the code is merge ready.

I'll run another local test before merging. Once that's done, let's coordinate deployment as usual.

Thanks for the review!

comment:23 Changed 5 months ago by karsten

Status: merge_readyneeds_review

I renamed now to lastSeenMillis in commit 27794f7, and I realized that we'll have to make similar changes to updating status which I did in commit 7e99b95. Please take another look at least at that second commit.

Right now I'm downloading omeiense's status/ and out/ directories and will run a local Onionoo server for a few hours, just to be really sure that it's doing the right thing.

comment:24 Changed 5 months ago by iwakeh

Status: needs_reviewmerge_ready

Replying to karsten:

I renamed now to lastSeenMillis in commit 27794f7, and I realized that we'll have to make similar changes to updating status which I did in commit 7e99b95. Please take another look at least at that second commit.

Good that you thought of this!
When taking another look at the tests, I'd say it's time to start using new DateTime classes. For example:

diff --git a/src/test/java/org/torproject/onionoo/docs/WeightsStatusTest.java b/src/test/java/org/torproject/onionoo/docs/WeightsStatusTest.java
index e28f989..7169aaa 100644
--- a/src/test/java/org/torproject/onionoo/docs/WeightsStatusTest.java
+++ b/src/test/java/org/torproject/onionoo/docs/WeightsStatusTest.java
@@ -9,6 +9,8 @@ import static org.junit.Assert.assertTrue;
 
 import org.junit.Test;
 
+import java.time.ZoneOffset;
+import java.time.LocalDateTime;
 import java.util.Arrays;
 import java.util.Map;
 import java.util.SortedMap;
@@ -101,7 +103,8 @@ public class WeightsStatusTest {
     }
     assertEquals("history: ", correctHistory, mapToString(ws.getHistory()));
     assertEquals(correctLines.length, ws.getHistory().size());
-    ws.compressHistory(1515670935476L);
+    ws.compressHistory(LocalDateTime.of(2017, 01, 01, 7, 15, 0, 0)
+        .toInstant(ZoneOffset.UTC).toEpochMilli());
     assertEquals("found: " + mapToString(ws.getHistory()), 1,
         ws.getHistory().size());
     assertEquals("[1431032400000, 1431043200000] : [-1.0, 1.78279826E-4, "

This is a bit longer, but will make the tests more readable and it will be immediately obvious what timestamp is used in a test. Maybe, start at least with the newly introduced tests here?

Right now I'm downloading omeiense's status/ and out/ directories and will run a local Onionoo server for a few hours, just to be really sure that it's doing the right thing.

Tests and checks pass. Ready for merge when the system test succeeds.

comment:25 Changed 5 months ago by karsten

Hmm, or should we replace all those milliseconds in the code with Java 8 date/time constructs all in a single step? I have to admit, I can't believe that that's the most elegant way of converting seven numbers into one number, yet you may be right. But I'd want to read up on Java 8 things first before changing things. (We do have a ticket for that, right?) Happy to add a ISO string what the long number means as a comment next to that line.

System tests have not even begun. I'm still working on downloading the status/ directory from omeiense...

comment:26 Changed 5 months ago by iwakeh

There is #24870 for Onionoo's date-time changes. Just leave the code here as is and do all changes together.

comment:27 Changed 5 months ago by karsten

Status: merge_readyneeds_review

Would you be able to review the last four commits in my updated task-16513-2 branch? It contains several fixes from local tests with omeiense's status/ directory.

I'm currently running another test for the last commit in that branch, and I'm considering a rewrite of the code that produces a graph history, so that we can reduce code duplication at least a bit. I'll see if that last part can be done within reasonable time, and if not, put it on hold again.

comment:28 in reply to:  27 ; Changed 5 months ago by iwakeh

Replying to karsten:

Would you be able to review the last four commits in my updated task-16513-2 branch? It contains several fixes from local tests with omeiense's status/ directory.

From skimming the commits:
Could you easily add simple tests for the comparison fix and the one-off-bug? These will document clearly what is intended with the current code.

I'm currently running another test for the last commit in that branch, and I'm considering a rewrite of the code that produces a graph history, so that we can reduce code duplication at least a bit. I'll see if that last part can be done within reasonable time, and if not, put it on hold again.

Will take a closer look at all commits soon.

comment:29 in reply to:  28 Changed 5 months ago by karsten

Replying to iwakeh:

Replying to karsten:

Would you be able to review the last four commits in my updated task-16513-2 branch? It contains several fixes from local tests with omeiense's status/ directory.

From skimming the commits:
Could you easily add simple tests for the comparison fix and the one-off-bug? These will document clearly what is intended with the current code.

Not easily, but I'll give it a try.

comment:30 Changed 5 months ago by karsten

Pushed one test.

comment:31 in reply to:  30 Changed 5 months ago by iwakeh

Replying to karsten:

Pushed one test.

Cool, thanks!

comment:32 Changed 5 months ago by iwakeh

Status: needs_reviewmerge_ready

All checks and tests pass; good to start using more java.time classes.
Would it be possible to rename the one-letter variables in BandwidthDocumentWriter; there are other places, too, but let's start slowly.
Merge ready, when all system tests ran/are running fine.

comment:33 Changed 5 months ago by karsten

Status: merge_readyneeds_review

Thanks for checking. I just added one small (but important) fixup commit (c915794) and another big (and also important) refactoring commit (699d826) to my branch. That second commit reduces code duplication by a lot, which I think is an important step towards making this code more maintainable by other people than me in the future.

However, I'm running out of time here. The new code is way better documented than usual, it compiles, checkstyle doesn't have complaints, and all tests pass, including new ones. But I can't spend another day on running system tests, and I do expect the output to be somewhat different from current output. But each difference needs investigation, and maybe the old output was wrong and the new output is right.

iwakeh or maybe irl, can I ask either of you to take over here? Reviewing and testing this code is not a trivial task, and it will almost certainly take longer than a few hours. But I could imagine it's a good way to dive deeper into this part of Onionoo, and at the same time ensure that this code will be more readable.

comment:34 Changed 5 months ago by iwakeh

Owner: changed from metrics-team to iwakeh
Status: needs_reviewaccepted

Sure, will run system tests and provide fixes, if necessary, as part of the review.

comment:35 Changed 5 months ago by karsten

Thank you!

comment:36 Changed 5 months ago by iwakeh

The list of exit_addresses is one more data item in onionoo's details documents that is depending on the current system time. Depending on the current time the addresses are added or omitted to the stored collection (cf. NodeDetailsStatusUpdater). Could the validAfter time of the latest consensus processed be a replacement for the system time?

Last edited 5 months ago by iwakeh (previous) (diff)

comment:37 in reply to:  36 ; Changed 5 months ago by karsten

Replying to iwakeh:

The list of exit_addresses is one more data item in onionoo's details documents that is depending on the current system time. Depending on the current time the addresses are added or omitted to the stored collection (cf. NodeDetailsStatusUpdater). Could the validAfter time of the latest consensus processed be a replacement for the system time?

Yes. We should probably do that. I haven't tried it out yet, nor have I looked around for similar places where we're depending on system time there.

However! That's a different issue. It's about making the update step less dependent on things like system time, though we won't be able to make it fully deterministic. Still a good thing to do. But it's different from making the write step deterministic where we're taking a status/ directory as input and writing files to the out/ directory. Should we collect this idea on a new ticket, like, "Make updating status files independent of system time"?

comment:38 in reply to:  37 Changed 5 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

The list of exit_addresses is one more data item in onionoo's details documents that is depending on the current system time. Depending on the current time the addresses are added or omitted to the stored collection (cf. NodeDetailsStatusUpdater). Could the validAfter time of the latest consensus processed be a replacement for the system time?

Yes. We should probably do that. I haven't tried it out yet, nor have I looked around for similar places where we're depending on system time there.

However! That's a different issue. It's about making the update step less dependent on things like system time, though we won't be able to make it fully deterministic. Still a good thing to do. But it's different from making the write step deterministic where we're taking a status/ directory as input and writing files to the out/ directory. Should we collect this idea on a new ticket, like, "Make updating status files independent of system time"?

Oh, I took this ticket as meaning 'Make results from Onionoo deterministic', but writing and updating are different concerns. Will open a new ticket (rather with the meaning I perceived) and check for more issues in the update process while continuing the review.

(new ticket #25002)

Last edited 5 months ago by iwakeh (previous) (diff)

comment:39 Changed 5 months ago by karsten

Priority: HighVery High

Raising priority of this ticket even more, because it's currently blocking #24729 which confuses a lot of relay operators and leads to repeated questions.

comment:40 Changed 5 months ago by iwakeh

Status: acceptedmerge_ready

The made changes look ok, although the new graph history compiler code is still not making much use the java.time classes. But, changing it raises some questions about the calculations and would take too long for this ticket. Thus, making a note later on the java8 ticket for Onionoo.

The changes here and the upcoming changes in 25002 will make Onionooo distances independent of their local state as far as possible.

The change here leads to identical 'out' directories when run with the same 'status' directory contents. I have a few minor unrelated commits.

Comparing the 'out' dir resulting from Onionoo-4.4-1.8.0 with the resulting 'out' from the new 5.0-1.9.0 can only be done manually. So far, I think those differences are due to the changes here and thus intended, but that's hard to verify. Maybe, make a copy of 'out' and 'status' before deployment to have a rollback option in the unlikely case there are unwanted differences?

In total: merge ready.

comment:41 Changed 5 months ago by karsten

Thanks for checking!

Those three commits (87177e9, 638d895, and 6308c89) look good.

Let's discuss deployment at today's meeting.

comment:42 Changed 5 months ago by karsten

Here's a possible improvement: how about we use the last consensus valid-after time or bridge network status publication timestamp as timestamp rather than the last time we saw a specific relay or bridge? I think we called that "Tor time" on #25002. This would lead to a smaller out/ directory, because we wouldn't write as detailed history files anymore for relays or bridges that are long gone. Does this make sense?

comment:43 in reply to:  42 Changed 5 months ago by iwakeh

Replying to karsten:

Here's a possible improvement: how about we use the last consensus valid-after time or bridge network status publication timestamp

as timestamp

Which timestamp in particular are you referring to?

rather than the last time we saw a specific relay or bridge? I think we called that "Tor time" on #25002. This would lead to a smaller out/ directory, because we wouldn't write as detailed history files anymore for relays or bridges that are long gone. Does this make sense?

Regardless, 'Tor time' makes sense, b/c it is a value from the network and not 'added' to the data.

comment:44 Changed 5 months ago by karsten

I pushed two commits related to this branch to a branch that is under review for #24729. Those commits and comments there also ask the question above. Feel free to move that part of the discussion here, but I'm also happy to keep it there.

comment:45 Changed 5 months ago by iwakeh

The two commits are fine. Could we use a more telling name for 'retrieveLastSeenMillis', maybe 'retrieveLatestNetworkUpdateMillis' or else? Also throughout replace 'lastSeenMillis' with something that indicates the usage better. I don't really have a good naming idea though.

Leaving this on 'merge ready'.

comment:46 Changed 5 months ago by karsten

How about mostRecentStatusMillis?

comment:47 in reply to:  46 Changed 5 months ago by iwakeh

Replying to karsten:

How about mostRecentStatusMillis?

Perfect!

comment:48 Changed 4 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Great! Renamed in a separate commit, squashed all fixup/squash commits in that branch, and pushed to master. I'd say this one is safe to be closed now. Closing. Thanks!

comment:49 Changed 3 months ago by karsten

Milestone: Onionoo-2.0.0Onionoo 2.0.0

Milestone renamed

Note: See TracTickets for help on using tickets.