Opened 4 years ago

Closed 4 years ago

#19720 closed enhancement (fixed)

CollecTor should be re-configurable without restart

Reported by: iwakeh Owned by: iwakeh
Priority: High Milestone: CollecTor 1.0.0
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: ctip
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This is based on #19018:
CollecTor should monitor the given configuration file and apply configuration changes, i.e., changes that are saved to the file, to new module runs.

Child Tickets

Change History (31)

comment:1 Changed 4 years ago by iwakeh

Milestone: CollecTor 1.1.0CollecTor 1.0.0

comment:2 Changed 4 years ago by iwakeh

Status: newassigned

comment:3 Changed 4 years ago by iwakeh

Status: assignedneeds_review

The configuration is now configurable at runtime, except for the scheduler properties.

Changing task scheduling is tricky for several reasons:

  • the scheduler would need to keep track of all tasks and initial properties and
  • find out about the single changes in the latter
  • it should only make adjustments , when the task(s) to be changed are not running

Can all be done, but seems out of scope here.

Please review my branch (two commits) (first commit for the one left "recent" path in bridgedescs).

comment:4 Changed 4 years ago by iwakeh

Just amended some checkstyle corrections to new code.
All still ready for review.

This code is now running on the mirror.

comment:5 Changed 4 years ago by karsten

I read through the changes and would like to read them once more, but before I do that: can you specify exactly what happens when the configuration changes? For example:

  • Do changes affect a currently running module execution, or are they only read at the beginning of the next execution?
  • Does it take up to 1 minute for changes to become effective, or do they become effective immediately?
  • Are changes that enable or disable modules or that change scheduling of modules ignored entirely, or is it harmful to even touch them?

comment:6 in reply to:  5 Changed 4 years ago by iwakeh

Replying to karsten:

I read through the changes and would like to read them once more, but before I do that: can you specify exactly what happens when the configuration changes?

Good point. Below I add a description of the process.

  • Do changes affect a currently running module execution, or are they only read at the beginning of the next execution?

No, the changes will be read at the next execution.

  • Does it take up to 1 minute for changes to become effective, or do they become effective immediately?

When a module starts (see below).

  • Are changes that enable or disable modules or that change scheduling of modules ignored entirely, or is it harmful to even touch them?

They're ignored as the Scheduler only uses the initial properties once.

Design of the Configuration Update

Changes to existing classes:

  • The Configuration class is now an Observable and has Properties.
  • All CollecTorMains use their own copy of the initial configuration and register as Observers.
  • Mainnow calls newConfigAvailable for setting the configuration.

Configuration begins to look at the modified time of the configuration file given as argument of setWatchableSourceAndLoad, i.e. a scheduled thread (currently every minute, which is usefully small, but an arbitrary choice) checks, if the file was modified. If it was modified, the properties held in 'Configuration' are cleared (CollecTorMains have their own copy each, they are not affected by this) and re-loaded from the file. If the loading succeeds, all Observers are informed.

CollecTorMain implements the update method of Observer. When this method is called, newConfigAvailable is set to true. In the run method this variable is checked and the new configuration is copied and used in this run and until a new configuration is available.


An unrelated change: I made Scheduler 'final' as it should not be extended.

comment:7 Changed 4 years ago by iwakeh

Priority: MediumHigh

Increasing priority, so this floats near the top of the stack of reviews.

comment:8 Changed 4 years ago by karsten

Okay, I agree with most design decisions here, except for one that might confuse operators more than necessary: the modified time of the configuration file should not be checked once per minute (or even more often) but rather at the start of a new module run. Imagine a situation where the next execution is less than 1 minute away and the operator makes a quick change to the configuration file. There's no way for them to find out whether that change will make it into the next run or not, because the thread might check in 1 second or in 59 seconds. Can we change that by taking out the thread and having modules check whether the configuration has changed?

A possible downside of this approach is that configuration problems cannot be found until the next module execution. But I think that's also when operators would expect problems to show up, not randomly 0-60 seconds after writing the configuration file.

Ideally, we'd have a shell script that we use to reload the configuration file and possibly to verify it without loading it. But that's for later.

Does this make sense?

comment:9 Changed 4 years ago by iwakeh

First some inline answers and then a summary below.

Replying to karsten:

Okay, I agree with most design decisions here, except for one that might confuse operators more than necessary: the modified time of the configuration file should not be checked once per minute (or even more often) but rather at the start of a new module run. Imagine a situation where the next execution is less than 1 minute away and the operator makes a quick change to the configuration file. There's no way for them to find out whether that change will make it into the next run or not, because the thread might check in 1 second or in 59 seconds.

The Operator can answer this question from the log:

Edit collector.properties and save. The next check runs and reports the change:

2016-08-08 09:15:56,948 INFO o.t.c.c.Configuration:63 Configuration file was changed.

Modules starting after that will receive the new configuration, which is explicitly reported.

2016-08-08 09:16:57,026 INFO o.t.c.c.CollecTorMain:45 Module updateindex received new configuration.
2016-08-08 09:16:57,027 INFO o.t.c.c.CollecTorMain:53 Starting updateindex module of CollecTor.

Of course, at the moment there is a possible waiting time of 60sec. See summary below.

Can we change that by taking out the thread and having modules check whether the configuration has changed?

The modified config file would be read by each module. A potential error will be noticed very late and all modules would have to deal with it. And, all modules now have to keep track of when the configuration changed.

Ideally, we'd have a shell script that we use to reload the configuration file and possibly to verify it without loading it. But that's for later.

That would be two steps:

  1. Edit and save.
  2. Don't forget to inform the running application (not necessarily using a shell script, but it could be a wrapper for that).

Step 2 is easily forgotten especially if there is a stressful situation requiring the change.

Summary

Maybe the following changes will help:

a) Keep the centralized modified-check.
b) Decrease check interval < 5 sec (is this fine for the impatient?).
c) Every module could also log that it received the notice about a config change (in the update method). Currently it logs only that it received the changed configuration before the run that uses it.

Using a) the config change can be detected centrally and a change will be effective when saved w/o additional steps for the operator.
Part b) will make the application's response more immediate. Maybe even log the entire configuration file in TRACE level.
Part c) also gives early feedback that the various modules will use the new configuration in their next run.

In addition to the existing logging this should be sufficient feedback to the operator.
Here an example of the planned log entries:

Edit collector.properties and save; the resulting modified time of the properties file is 09:15:52. The next check runs a few seconds later and reports the change:

2016-08-08 09:15:56,948 INFO o.t.c.c.Configuration:63 Configuration file was changed.
2016-08-08 09:15:56,949 INFO o.t.c.c.Configuration:63 Module updateindex will use new configuration.
2016-08-08 09:15:56,949 INFO o.t.c.c.Configuration:63 Module relaydescs will use new configuration.
...
2016-08-08 09:16:57,026 INFO o.t.c.c.CollecTorMain:45 Module updateindex received new configuration.
2016-08-08 09:16:57,027 INFO o.t.c.c.CollecTorMain:53 Starting updateindex module of CollecTor.
...
2016-08-08 10:09:00,001 INFO o.t.c.c.CollecTorMain:45 Module relaydescs received new configuration.
2016-08-08 10:09:00,002 INFO o.t.c.c.CollecTorMain:53 Starting relaydescs module of CollecTor.

What do you think?

Last edited 4 years ago by iwakeh (previous) (diff)

comment:10 Changed 4 years ago by karsten

The impatient? Nooo, not true. The forgetful and easily confused!

Okay, one question and two suggestions:

  • What happens if the operator breaks the configuration file? Will current module runs be affected, that is, will they be aborted together with the scheduler? If not, will the scheduler make another attempt to re-read the configuration file 60 (or 5) seconds later? And if so, will it warn every 60 (or 5) seconds that the configuration is broken? Okay, that's more than one question, but these seem related.
  • 10 or 15 seconds might work, too, if 5 is too small. Just enough to save the file and tail -f the log file to see how the changes are processed. So, I'd say anything between 5 and 15 would work, please pick your favorite.
  • The behavior of all this should be described at the top of collector.properties, so that operators know exactly what will happen when they edit the file while CollecTor is running rather than having to learn it from the logs after it has happened.

Sorry to make this more difficult, but I'm thinking of all the services I'm running and how to memorize how to do that, and I'm thinking of external folks running our services in the near future and making this as easy as possible for them. Thanks!

comment:11 Changed 4 years ago by karsten

weasel notes on IRC that services that auto-check if their configuration files change are extremely suspicious. He says that a service should require some signal to re-read its configuration file.

Let's think again if we can implement something like that. I briefly looked whether we can catch a SIGHUP signal, but that doesn't seem very platform independent. Another way would be to touch a "reload-config" file of some sort and have the current thread check whether that exists. But there are probably better ways. Hmmmmm.

comment:12 in reply to:  10 Changed 4 years ago by iwakeh

Replying to karsten:

The impatient? Nooo, not true. The forgetful and easily confused!

:-)

Okay, one question and two suggestions:

  • What happens if the operator breaks the configuration file? Will current module runs be affected, that is, will they be aborted together with the scheduler? If not, will the scheduler make another attempt to re-read the configuration file 60 (or 5) seconds later? And if so, will it warn every 60 (or 5) seconds that the configuration is broken? Okay, that's more than one question, but these seem related.

The design should be quite robust concerning editing errors.
First, there are actually two categories of 'breaking' the configuration:

  1. syntactically: the properties syntax is wrong and the file cannot be read by java.util.Properties
  2. semantically: a valid property contains a bad value, e.g. a valid but wrong URL.

Case 1: an error is logged and the modules keep running with their working configuration, i.e. the modules don't hear about a new config.

Case 2: All modules are informed and will use the new configuration with their next run. Depending on the 'wrongness' of the value supplied the module affected will in worst case throw an Exception. All modules with correct properties keep running fine. The affected module will be rescheduled at the scheduled time from the start-up configuration.

In both cases Configuration keeps checking the modified-time, no matter what the edit caused, but it won't re-read unless there was another change. (Looking at the code after working on #19771 I'm going to change the catch clause into catch-all, too.)

  • 10 or 15 seconds might work, too, if 5 is too small. Just enough to save the file and tail -f the log file to see how the changes are processed. So, I'd say anything between 5 and 15 would work, please pick your favorite.

Fine.

  • The behavior of all this should be described at the top of collector.properties, so that operators know exactly what will happen when they edit the file while CollecTor is running rather than having to learn it from the logs after it has happened.

This is a very important point!

Sorry to make this more difficult, but I'm thinking of all the services I'm running and how to memorize how to do that, and I'm thinking of external folks running our services in the near future and making this as easy as possible for them. Thanks!

In the opposite, voicing all the questions and doubts is important!
It surely takes less time to write about things beforehand than troubleshooting an externally running instance later and preparing bugfix-releases.

Thanks!

Are there any more questions/suggestions? Can I start with the proposed changes?

comment:13 in reply to:  11 Changed 4 years ago by iwakeh

Replying to karsten:

weasel notes on IRC that services that auto-check if their configuration files change are extremely suspicious. He says that a service should require some signal to re-read its configuration file.

Let's think again if we can implement something like that. I briefly looked whether we can catch a SIGHUP signal, but that doesn't seem very platform independent. Another way would be to touch a "reload-config" file of some sort and have the current thread check whether that exists. But there are probably better ways. Hmmmmm.

I didn't notice that before answering the previous comment. So:

Well, how much of a 'service' the different modules in CollecTor constitute is surely debatable.

Therefor two questions:

  • What is the use-case that justifies here when having to edit and then signal a change?
  • Isn't it worse in this particular case to forget the second step?

Anyway, if we decide to have the two-level re-config, one can always use other Java ways of 'signalling' the application. But that will be a future release.

comment:14 Changed 4 years ago by karsten

We should probably first come to a conclusion how the configuration reload is triggered. I tried to come up with a few examples how other services handle configuration changes:

  • Apache lets you edit config files and won't bother looking at them until you service apache2 reload. I find that pretty easy to work with.
  • Tomcat checks every few seconds if there are new or updated webapps in its webapps folder and auto-deploys them. I'm not a big fan of that approach because of the random (though short) delay, but I got used to it.
  • visudo lets you edit the sudoers file and makes changes effective immediately upon saving. I'm okay with that, but I don't like how this is bound to a specific editor (which you can change, but which still seems cumbersome).
  • crontab -e lets you update the crontab and makes changes effective immediately. I'm always very careful and afraid of making mistakes when making changes, which I wouldn't be if there was a separate reload step and I'd have the chance to review the configuration before reloading.

So, generalizing from these examples:

  • I prefer using the editor of my choice, without having to look up how to define that editor somewhere, but just by opening the config file, editing, and saving.
  • I prefer that changes become effective immediately when I trigger the update, not at a random time after the change.
  • I prefer being able to review configuration changes, for example by copying the original file before editing and comparing the copy to the updated file after saving.

However, those are only my preferences. I'd like us to find a solution that's least surprising to current and future operators. I wonder if we should prepare an operator's manual with different options (not only for making configuration changes but also for configuring logging and other things unrelated to this ticket) and give that to potential operators to hear what they'd find most usable and least surprising. Hmm.

comment:15 Changed 4 years ago by iwakeh

I would like to separate concerns:

First of all we provide a product that should work, be easy to deploy, and maintainable.
The deployment part is already pretty easy now: one executable jar and one config file.
Similarly maintenance, which includes troubleshooting of remote instances. For helping
an external operator we simply need: their collector.properties, the
revision number of the jar, and some log files.

How operation is actually managed is not our concern. We should strive to make
many things possible. With the setup we chose so far that's the case.

Actually all the three nice-to-haves you mention are already possible:

So, generalizing from these examples:

  • I prefer using the editor of my choice, without having to look up how to define that editor somewhere, but just by opening the config file, editing, and saving.

That's clearly met.

  • I prefer that changes become effective immediately when I trigger the update, not at a random time after the change.

I admit the meaning of immediate is stretchable. At the latest five seconds after saving the change will be acknowledged by CollecTor; and it also informs about when the active modules will use the new config.

  • I prefer being able to review configuration changes, for example by copying the original file before editing and comparing the copy to the updated file after saving.

That is an operational question, which doesn't affect the product design. By adding collector.properties to a git repo you have what you require. Ensuring that the changes
are committed is part of the operation.

Of course, there are more operational issues: who may alter the configuration and the like.
These are not part of the application/product.

Saying that some issues are part of operation doesn't mean they're unimportant.
I think they will be important at a later stage when we want to provide a Debian
package of CollecTor, for example. Then we need to think about how the re-configuration
mechanism has to be wrapped, where to put the configuration, etc.

The way it is now, leaves lots of room to accommodate many operational styles, I think.

So, I really would like to include the re-config into the first release.
And, then we improve things. For a first release it's fine.

comment:16 Changed 4 years ago by iwakeh

No matter, which release this will be included in.
Here are the changes discussed above. So, the useful suggestions from above won't be lost in this long list of comments.

Please review the top commit in addition.

Last edited 4 years ago by iwakeh (previous) (diff)

comment:17 Changed 4 years ago by karsten

Hmmm hmmmm hmmm hmmm hmmmmmmmm. In other words, I'm unclear what's best here. I would like us to, at some point, re-evaluate how we're implementing configuration, logging, and possibly other things closely related to operation. For example, a while ago it took me a moment to accept that CollecTor writes the default collector.properties if that file didn't exist; that was a bit too much magic for me, though I see the benefit. I'm also not entirely happy with the way we're logging to LOGBASE_IS_UNDEFINED/ if the operator doesn't specify a directory, though I don't have a better idea how to do this right now. And now I'm slightly concerned about the configuration file to be re-read without the operator explicitly asking CollecTor to do that (see my earlier comments before weasel chimed in).

But I agree that we'll have to rethink these things anyway when we talk about providing (Debian) packages. And I agree that having a way to reconfigure CollecTor while it's running is useful. So, I suggest that I'll review the changes you posted and then merge them later today or early tomorrow. And we create a ticket for a later milestone where we reconsider these design decisions about configuration, logging, etc. How's that?

comment:18 Changed 4 years ago by iwakeh

That's fine. Ticket for re-evaluation is #19873, currently set for milestone 1.2.0.

comment:19 Changed 4 years ago by karsten

Please review my task-19720 branch which contains your patches, rebased to master. And please take a look at the temp commit that ignores two tests that somehow don't pass. We should fix those tests. Happy to apply a fixup commit and revert that temp commit. (I'll also add more commits to that same branch in preparation of the release. Feel free to ignore those in the context of this ticket.)

comment:20 Changed 4 years ago by iwakeh

That's odd, I checked out your merged branch, removed the Ignore annotations from ConfigurationTest and everything worked fine.

Maybe, let the tests wait longer, i.e. set the lines MainTest.waitSec(6); of the respective tests to MainTest.waitSec(10);.
Could you try if that works?

comment:21 Changed 4 years ago by karsten

Hmm, no, that doesn't fix the tests. Here's what I get after reverting that Ignore temp commit:

    [junit] Testcase: testBoolValues took 0.513 sec
    [junit] Testcase: testPathValueException took 0.006 sec
    [junit] Testcase: testArrayValueException took 0.001 sec
    [junit] Testcase: testConfigUnreadable took 6.015 sec
    [junit] 	FAILED
    [junit] expected:<0> but was:<2>
    [junit] junit.framework.AssertionFailedError: expected:<0> but was:<2>
    [junit] 	at org.torproject.collector.conf.ConfigurationTest.testConfigUnreadable(ConfigurationTest.java:212)
    [junit] 
    [junit] Testcase: testConfiguration took 0 sec
    [junit] Testcase: testIntValueException took 0.001 sec
    [junit] Testcase: testSetWatchableSourceAndLoad took 0.002 sec
    [junit] Testcase: testArrayArrayValueException took 0.001 sec
    [junit] Testcase: testArrayValues took 0.004 sec
    [junit] Testcase: testIntValues took 0.001 sec
    [junit] Testcase: testConfigChange took 6.006 sec
    [junit] 	FAILED
    [junit] Update was not called.
    [junit] junit.framework.AssertionFailedError: Update was not called.
    [junit] 	at org.torproject.collector.conf.ConfigurationTest.testConfigChange(ConfigurationTest.java:188)
    [junit] 
    [junit] Testcase: testFileValues took 0.001 sec
    [junit] Testcase: testBoolValueException took 0.001 sec
    [junit] Testcase: testKeyCount took 0.001 sec
    [junit] Testcase: testArrayArrayValues took 0.001 sec
    [junit] Testcase: testUrlValueException took 0.001 sec

comment:22 Changed 4 years ago by iwakeh

What file system are you testing on?

Does the following make the second test work?

diff --git a/src/test/java/org/torproject/collector/conf/ConfigurationTest.java b/src/test/java/org/torproject/collector/conf/ConfigurationTest.java
index 6013d09..24ab17d 100644
--- a/src/test/java/org/torproject/collector/conf/ConfigurationTest.java
+++ b/src/test/java/org/torproject/collector/conf/ConfigurationTest.java
@@ -173,7 +173,6 @@ public class ConfigurationTest {
     conf.setWatchableSourceAndLoad(Paths.get("/tmp/phantom.path"));
   }
 
-  @Ignore("Test fails: Update was not called.")
   @Test()
   public void testConfigChange() throws Exception {
     Configuration conf = new Configuration();
@@ -193,7 +192,6 @@ public class ConfigurationTest {
     assertFalse("Update was called.", called.get());
   }
 
-  @Ignore("Test fails: expected:<0> but was:<2>")
   @Test()
   public void testConfigUnreadable() throws Exception {
     Configuration conf = new Configuration();
@@ -206,17 +204,19 @@ public class ConfigurationTest {
     File confFile = tmpf.newFile("empty");
     conf.setWatchableSourceAndLoad(confFile.toPath());
     confFile.delete();
-    conf.setProperty(Key.CompressRelayDescriptorDownloads.name(), "false");
-    conf.setProperty(Key.ImportDirectoryArchives.name(), "false");
     Dummy dummy = new Dummy(conf);
+    Field confField = CollecTorMain.class.getDeclaredField("config");
+    confField.setAccessible(true);
+    Configuration dummiesConf = (Configuration)(confField.get(dummy));
+    dummiesConf.setProperty(Key.CompressRelayDescriptorDownloads.name(),
+        "false");
+    dummiesConf.setProperty(Key.ImportDirectoryArchives.name(), "false");
+    assertEquals(2, dummiesConf.size());
     tmpf.newFolder("empty");
     MainTest.waitSec(6);
     assertFalse("Update was called.", called.get());
     assertEquals(0, conf.size());
-    Field confField = CollecTorMain.class.getDeclaredField("config");
-    confField.setAccessible(true);
-    int size = ((Configuration)(confField.get(dummy))).size();
-    assertEquals(2, size);
+    assertEquals(2, dummiesConf.size());
   }
 
 }

comment:23 Changed 4 years ago by iwakeh

A change for testConfigChange; now really writing to the file.

@@ -19,6 +19,7 @@ import org.junit.rules.TemporaryFolder;
 import java.io.ByteArrayInputStream;
 import java.io.File;
 import java.lang.reflect.Field;
+import java.nio.file.Files;
 import java.nio.file.Paths;
 import java.util.Arrays;
 import java.util.Observable;
@@ -173,7 +174,6 @@ public class ConfigurationTest {
     conf.setWatchableSourceAndLoad(Paths.get("/tmp/phantom.path"));
   }
 
-  @Ignore("Test fails: Update was not called.")
   @Test()
   public void testConfigChange() throws Exception {
     Configuration conf = new Configuration();
@@ -183,17 +183,17 @@ public class ConfigurationTest {
           called.set(true);
         }
       });
-    File confFile = tmpf.newFile("empty");
+    File confFile = tmpf.newFile("emptyConf");
     conf.setWatchableSourceAndLoad(confFile.toPath());
     confFile.setLastModified(System.currentTimeMillis());
-    MainTest.waitSec(6);
+    Files.write(confFile.toPath(), "##".getBytes());
+    MainTest.waitSec(10);
     assertTrue("Update was not called.", called.get());
     called.set(false);
-    MainTest.waitSec(6);
+    MainTest.waitSec(10);
     assertFalse("Update was called.", called.get());
   }

comment:24 Changed 4 years ago by karsten

File system is /dev/disk1 on / (hfs, local, journaled) on Mac OS X El Capitan 10.11.6. Applying both diffs manually fixes testConfigUnreadable() but still doesn't fix testConfigChange(). Hmm.

comment:25 Changed 4 years ago by iwakeh

Well, as is said in the INSTALL.md you're on your own when not on Debian stable.

Could you verify that the tests w/o the above changes pass on the target server environment?
If that is the case, it's fine.

The modified time is apparently set differently or the implementation between Java/filesystem is different on OSX/hfs than on Linux/ext.

Last edited 4 years ago by iwakeh (previous) (diff)

comment:26 Changed 4 years ago by iwakeh

Here the api doc from Oracle.

comment:27 Changed 4 years ago by iwakeh

Suspicion confirmed MacOSX hfs always reports creation time as modified time. Thus, the update is not triggered and the tests fail correctly.

Verified on OS X Lion hfs using ugly System.err to print Files.getLastModifiedTime(confFile.toPath()) inside the tests.

comment:28 Changed 4 years ago by karsten

Hmm. I didn't have time to look into the issue yet, but is there a workaround that fixes this issue on OS X? It's one thing to say that operating instructions are written for a specific operating system. But it's another thing to knowingly break, or not fix, the development environment on a previously working operating system.

comment:29 Changed 4 years ago by karsten

Okay, after looking very briefly into the issue, I don't quite understand your statement that "hfs always reports creation time as modified time". I'm using File.lastModified() in other places without issues where it gives me the modified time, not the creation time. Can you give a minimal example that works on Linux and doesn't work on OS X?

comment:30 in reply to:  29 Changed 4 years ago by iwakeh

More precisely:
The java implementations differ and the one on osx reports the same values for the test cases, where the linux implementation gives different values. So, the time resolution for changes on linux/java seems finer than on osx/java.

  • Linux (time printed immediately before and after setting the time in testConfigChange)
       [junit] c 1 2016-08-10T18:42:22.57207Z
       [junit] c 2 2016-08-10T18:42:22.576Z
    
  • OSX gives two times the same value (not pasted).

This branch makes the tests work on OS X, too.

Last edited 4 years ago by iwakeh (previous) (diff)

comment:31 Changed 4 years ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Great, it works! Cherry-picked and pushed to master, together with the other commits in my task-19720 branch. Closing. Thanks!

Note: See TracTickets for help on using tickets.