CollecTor should be re-configurable without restart

changed milestone to %CollecTor 1.0.0

added component::metrics/collector ctip milestone::CollecTor 1.0.0 owner::iwakeh priority::high resolution::fixed severity::normal status::closed type::enhancement labels

Trac:
Milestone: CollecTor 1.1.0 to CollecTor 1.0.0

Trac:
Status: new to assigned

The configuration is now configurable at runtime, except for the scheduler properties.

Changing task scheduling is tricky for several reasons:

the scheduler would need to keep track of all tasks and initial properties and
find out about the single changes in the latter
it should only make adjustments , when the task(s) to be changed are not running Can all be done, but seems out of scope here.

Please review my branch (two commits) (first commit for the one left "recent" path in bridgedescs).

Trac:
Status: assigned to needs_review

Just amended some checkstyle corrections to new code. All still ready for review.

This code is now running on the mirror.

I read through the changes and would like to read them once more, but before I do that: can you specify exactly what happens when the configuration changes? For example:

Do changes affect a currently running module execution, or are they only read at the beginning of the next execution?
Does it take up to 1 minute for changes to become effective, or do they become effective immediately?
Are changes that enable or disable modules or that change scheduling of modules ignored entirely, or is it harmful to even touch them?

Replying to karsten:

I read through the changes and would like to read them once more, but before I do that: can you specify exactly what happens when the configuration changes?

Good point. Below I add a description of the process.

Do changes affect a currently running module execution, or are they only read at the beginning of the next execution?

No, the changes will be read at the next execution.

Does it take up to 1 minute for changes to become effective, or do they become effective immediately?

When a module starts (see below).

Are changes that enable or disable modules or that change scheduling of modules ignored entirely, or is it harmful to even touch them?

They're ignored as the Scheduler only uses the initial properties once.

== Design of the Configuration Update

Changes to existing classes:

The Configuration class is now an Observable and has Properties.
All CollecTorMains use their own copy of the initial configuration and register as Observers.
Mainnow calls newConfigAvailable for setting the configuration.

Configuration begins to look at the modified time of the configuration file given as argument of setWatchableSourceAndLoad, i.e. a scheduled thread (currently every minute, which is usefully small, but an arbitrary choice) checks, if the file was modified. If it was modified, the properties held in 'Configuration' are cleared (CollecTorMains have their own copy each, they are not affected by this) and re-loaded from the file. If the loading succeeds, all Observers are informed.

CollecTorMain implements the update method of Observer. When this method is called, newConfigAvailable is set to true. In the run method this variable is checked and the new configuration is copied and used in this run and until a new configuration is available.

An unrelated change: I made Scheduler 'final' as it should not be extended.

Increasing priority, so this floats near the top of the stack of reviews.

Trac:
Priority: Medium to High

Okay, I agree with most design decisions here, except for one that might confuse operators more than necessary: the modified time of the configuration file should not be checked once per minute (or even more often) but rather at the start of a new module run. Imagine a situation where the next execution is less than 1 minute away and the operator makes a quick change to the configuration file. There's no way for them to find out whether that change will make it into the next run or not, because the thread might check in 1 second or in 59 seconds. Can we change that by taking out the thread and having modules check whether the configuration has changed?

A possible downside of this approach is that configuration problems cannot be found until the next module execution. But I think that's also when operators would expect problems to show up, not randomly 0-60 seconds after writing the configuration file.

Ideally, we'd have a shell script that we use to reload the configuration file and possibly to verify it without loading it. But that's for later.

Does this make sense?

First some inline answers and then a summary below.

Replying to karsten:

Okay, I agree with most design decisions here, except for one that might confuse operators more than necessary: the modified time of the configuration file should not be checked once per minute (or even more often) but rather at the start of a new module run. Imagine a situation where the next execution is less than 1 minute away and the operator makes a quick change to the configuration file. There's no way for them to find out whether that change will make it into the next run or not, because the thread might check in 1 second or in 59 seconds.

The Operator can answer this question from the log:

Edit collector.properties and save. The next check runs and reports the change:

2016-08-08 09:15:56,948 INFO o.t.c.c.Configuration:63 Configuration file was changed.

Modules starting after that will receive the new configuration, which is explicitly reported.

2016-08-08 09:16:57,026 INFO o.t.c.c.CollecTorMain:45 Module updateindex received new configuration.
2016-08-08 09:16:57,027 INFO o.t.c.c.CollecTorMain:53 Starting updateindex module of CollecTor.

Of course, at the moment there is a possible waiting time of 60sec. See summary below.

Can we change that by taking out the thread and having modules check whether the configuration has changed?

The modified config file would be read by each module. A potential error will be noticed very late and all modules would have to deal with it. And, all modules now have to keep track of when the configuration changed.

Ideally, we'd have a shell script that we use to reload the configuration file and possibly to verify it without loading it. But that's for later.

That would be two steps:

Edit and save.
Don't forget to inform the running application (not necessarily using a shell script, but it could be a wrapper for that).

Step 2 is easily forgotten especially if there is a stressful situation requiring the change.

=== Summary Maybe the following changes will help:

a) Keep the centralized modified-check. b) Decrease check interval < 5 sec (is this fine for the impatient?). c) Every module could also log that it received the notice about a config change (in the update method). Currently it logs only that it received the changed configuration before the run that uses it.

Using a) the config change can be detected centrally and a change will be effective when saved w/o additional steps for the operator. Part b) will make the application's response more immediate. Maybe even log the entire configuration file in TRACE level. Part c) also gives early feedback that the various modules will use the new configuration in their next run.

In addition to the existing logging this should be sufficient feedback to the operator. Here an example of the planned log entries:

Edit collector.properties and save; the resulting modified time of the properties file is 09:15:52. The next check runs a few seconds later and reports the change:

2016-08-08 09:15:56,948 INFO o.t.c.c.Configuration:63 Configuration file was changed.
2016-08-08 09:15:56,949 INFO o.t.c.c.Configuration:63 Module updateindex will use new configuration.
2016-08-08 09:15:56,949 INFO o.t.c.c.Configuration:63 Module relaydescs will use new configuration.
...
2016-08-08 09:16:57,026 INFO o.t.c.c.CollecTorMain:45 Module updateindex received new configuration.
2016-08-08 09:16:57,027 INFO o.t.c.c.CollecTorMain:53 Starting updateindex module of CollecTor.
...
2016-08-08 10:09:00,001 INFO o.t.c.c.CollecTorMain:45 Module relaydescs received new configuration.
2016-08-08 10:09:00,002 INFO o.t.c.c.CollecTorMain:53 Starting relaydescs module of CollecTor.

What do you think?

The impatient? Nooo, not true. The forgetful and easily confused!

Okay, one question and two suggestions:

What happens if the operator breaks the configuration file? Will current module runs be affected, that is, will they be aborted together with the scheduler? If not, will the scheduler make another attempt to re-read the configuration file 60 (or 5) seconds later? And if so, will it warn every 60 (or 5) seconds that the configuration is broken? Okay, that's more than one question, but these seem related.
10 or 15 seconds might work, too, if 5 is too small. Just enough to save the file and tail -f the log file to see how the changes are processed. So, I'd say anything between 5 and 15 would work, please pick your favorite.
The behavior of all this should be described at the top of collector.properties, so that operators know exactly what will happen when they edit the file while CollecTor is running rather than having to learn it from the logs after it has happened.

Sorry to make this more difficult, but I'm thinking of all the services I'm running and how to memorize how to do that, and I'm thinking of external folks running our services in the near future and making this as easy as possible for them. Thanks!

weasel notes on IRC that services that auto-check if their configuration files change are extremely suspicious. He says that a service should require some signal to re-read its configuration file.

Let's think again if we can implement something like that. I briefly looked whether we can catch a SIGHUP signal, but that doesn't seem very platform independent. Another way would be to touch a "reload-config" file of some sort and have the current thread check whether that exists. But there are probably better ways. Hmmmmm.

Replying to karsten:

The impatient? Nooo, not true. The forgetful and easily confused!

:-)

Okay, one question and two suggestions:

What happens if the operator breaks the configuration file? Will current module runs be affected, that is, will they be aborted together with the scheduler? If not, will the scheduler make another attempt to re-read the configuration file 60 (or 5) seconds later? And if so, will it warn every 60 (or 5) seconds that the configuration is broken? Okay, that's more than one question, but these seem related.

The design should be quite robust concerning editing errors. First, there are actually two categories of 'breaking' the configuration:

syntactically: the properties syntax is wrong and the file cannot be read by java.util.Properties
semantically: a valid property contains a bad value, e.g. a valid but wrong URL.

Case 1: an error is logged and the modules keep running with their working configuration, i.e. the modules don't hear about a new config.

Case 2: All modules are informed and will use the new configuration with their next run. Depending on the 'wrongness' of the value supplied the module affected will in worst case throw an Exception. All modules with correct properties keep running fine. The affected module will be rescheduled at the scheduled time from the start-up configuration.

In both cases Configuration keeps checking the modified-time, no matter what the edit caused, but it won't re-read unless there was another change. (Looking at the code after working on #19771 (moved) I'm going to change the catch clause into catch-all, too.)

10 or 15 seconds might work, too, if 5 is too small. Just enough to save the file and tail -f the log file to see how the changes are processed. So, I'd say anything between 5 and 15 would work, please pick your favorite.

Fine.

The behavior of all this should be described at the top of collector.properties, so that operators know exactly what will happen when they edit the file while CollecTor is running rather than having to learn it from the logs after it has happened.

This is a very important point!

Sorry to make this more difficult, but I'm thinking of all the services I'm running and how to memorize how to do that, and I'm thinking of external folks running our services in the near future and making this as easy as possible for them. Thanks!

In the opposite, voicing all the questions and doubts is important! It surely takes less time to write about things beforehand than troubleshooting an externally running instance later and preparing bugfix-releases.

Thanks!

Are there any more questions/suggestions? Can I start with the proposed changes?

Replying to karsten:

weasel notes on IRC that services that auto-check if their configuration files change are extremely suspicious. He says that a service should require some signal to re-read its configuration file.

Let's think again if we can implement something like that. I briefly looked whether we can catch a SIGHUP signal, but that doesn't seem very platform independent. Another way would be to touch a "reload-config" file of some sort and have the current thread check whether that exists. But there are probably better ways. Hmmmmm.

I didn't notice that before answering the previous comment. So:

Well, how much of a 'service' the different modules in CollecTor constitute is surely debatable.

Therefor two questions:

What is the use-case that justifies here when having to edit and then signal a change?
Isn't it worse in this particular case to forget the second step?

Anyway, if we decide to have the two-level re-config, one can always use other Java ways of 'signalling' the application. But that will be a future release.

We should probably first come to a conclusion how the configuration reload is triggered. I tried to come up with a few examples how other services handle configuration changes:

Apache lets you edit config files and won't bother looking at them until you service apache2 reload. I find that pretty easy to work with.
Tomcat checks every few seconds if there are new or updated webapps in its webapps folder and auto-deploys them. I'm not a big fan of that approach because of the random (though short) delay, but I got used to it.
visudo lets you edit the sudoers file and makes changes effective immediately upon saving. I'm okay with that, but I don't like how this is bound to a specific editor (which you can change, but which still seems cumbersome).
crontab -e lets you update the crontab and makes changes effective immediately. I'm always very careful and afraid of making mistakes when making changes, which I wouldn't be if there was a separate reload step and I'd have the chance to review the configuration before reloading.

So, generalizing from these examples:

I prefer using the editor of my choice, without having to look up how to define that editor somewhere, but just by opening the config file, editing, and saving.
I prefer that changes become effective immediately when I trigger the update, not at a random time after the change.
I prefer being able to review configuration changes, for example by copying the original file before editing and comparing the copy to the updated file after saving.

However, those are only my preferences. I'd like us to find a solution that's least surprising to current and future operators. I wonder if we should prepare an operator's manual with different options (not only for making configuration changes but also for configuring logging and other things unrelated to this ticket) and give that to potential operators to hear what they'd find most usable and least surprising. Hmm.

I would like to separate concerns:

First of all we provide a product that should work, be easy to deploy, and maintainable. The deployment part is already pretty easy now: one executable jar and one config file. Similarly maintenance, which includes troubleshooting of remote instances. For helping an external operator we simply need: their collector.properties, the revision number of the jar, and some log files.

How operation is actually managed is not our concern. We should strive to make many things possible. With the setup we chose so far that's the case.

Actually all the three nice-to-haves you mention are already possible:

So, generalizing from these examples:

I prefer using the editor of my choice, without having to look up how to define that editor somewhere, but just by opening the config file, editing, and saving.

That's clearly met.

I prefer that changes become effective immediately when I trigger the update, not at a random time after the change.

I admit the meaning of immediate is stretchable. At the latest five seconds after saving the change will be acknowledged by CollecTor; and it also informs about when the active modules will use the new config.

I prefer being able to review configuration changes, for example by copying the original file before editing and comparing the copy to the updated file after saving.

That is an operational question, which doesn't affect the product design. By adding collector.properties to a git repo you have what you require. Ensuring that the changes are committed is part of the operation.

Of course, there are more operational issues: who may alter the configuration and the like. These are not part of the application/product.

Saying that some issues are part of operation doesn't mean they're unimportant. I think they will be important at a later stage when we want to provide a Debian package of CollecTor, for example. Then we need to think about how the re-configuration mechanism has to be wrapped, where to put the configuration, etc.

The way it is now, leaves lots of room to accommodate many operational styles, I think.

So, I really would like to include the re-config into the first release. And, then we improve things. For a first release it's fine.

No matter, which release this will be included in. Here are the changes discussed above. So, the useful suggestions from above won't be lost in this long list of comments.

Please review the top commit in addition.

Hmmm hmmmm hmmm hmmm hmmmmmmmm. In other words, I'm unclear what's best here. I would like us to, at some point, re-evaluate how we're implementing configuration, logging, and possibly other things closely related to operation. For example, a while ago it took me a moment to accept that CollecTor writes the default collector.properties if that file didn't exist; that was a bit too much magic for me, though I see the benefit. I'm also not entirely happy with the way we're logging to LOGBASE_IS_UNDEFINED/ if the operator doesn't specify a directory, though I don't have a better idea how to do this right now. And now I'm slightly concerned about the configuration file to be re-read without the operator explicitly asking CollecTor to do that (see my earlier comments before weasel chimed in).

But I agree that we'll have to rethink these things anyway when we talk about providing (Debian) packages. And I agree that having a way to reconfigure CollecTor while it's running is useful. So, I suggest that I'll review the changes you posted and then merge them later today or early tomorrow. And we create a ticket for a later milestone where we reconsider these design decisions about configuration, logging, etc. How's that?

That's fine. Ticket for re-evaluation is #19873 (moved), currently set for milestone 1.2.0.

CollecTor should be re-configurable without restart

Child items ...

Activity